Papers
arxiv:2409.17066

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Published on Sep 25, 2024
· Submitted by yangwang92 on Sep 30, 2024
Authors:
,

Abstract

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8times increase in inference throughput compared to SOTA.

Community

Paper author Paper submitter
edited Oct 6, 2024

VPTQ (Vector Post-Training Quantization) is an advanced compression technique that dramatically reduces the size of large language models such as the 70B and 405B Llama models. VPTQ efficiently compresses these models to 1-2 bits within just a few hours, enabling them to run effectively on GPUs with limited memory.

Llama 3.1 70b chat on RTX4090 (24G @ 2bit)

Llama3.1-70b-chat.gif

Llama 3.1 70b prompt on RTX4090 (24G @ 2bit)

Llama3.1-70b-prompt.gif

in the tables, for example table 2, you have highlighted the best values where VPTQ beats other quantization methods, but you did not highlight the highest values where other methods were better. It would be a lot better if you'd highlight the highest values everywhere instead of giving VPTQ preferential treatment by only highlighting the highest values if they are from your method :)

also just a small thing on the side for clarity, maybe changing unit descriptions from something like mem/GB, cost/h to mem (GB), cost (h) would help a bit with understandability. I was confused at first at mem/GB because i thought it meant "memory per gigabyte".

There are also some other text issues, like the duplicate sentence at the top of page 3: " Un-
der the guidance of the optimization problem, Under the guidance of the optimization problem".

content wise though, looks like super great work!

·
Paper author

Thanks for your suggestion. our paper reviewer also points out the highlights and typos in the table. And we will fix this in our camera-ready version. : -)

The current tech report is an early version that introduces our methods and early results. Thanks for your kind suggestion!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 66

Browse 66 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.17066 in a dataset README.md to link it from this page.

Spaces citing this paper 4

Collections including this paper 7