Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Abstract
Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but unfortunately, it suffers from significant performance degradation below 4-bit precision. An alternative approach involves training compressed models directly at a low bitwidth (e.g., binary or ternary models). However, the performance, training dynamics, and scaling trends of such models are not yet well understood. To address this issue, we train and openly release the Spectra LLM suite consisting of 54 language models ranging from 99M to 3.9B parameters, trained on 300B tokens. Spectra includes FloatLMs, post-training quantized QuantLMs (3, 4, 6, and 8 bits), and ternary LLMs (TriLMs) - our improved architecture for ternary language modeling, which significantly outperforms previously proposed ternary models of a given size (in bits), matching half-precision models at scale. For example, TriLM 3.9B is (bit-wise) smaller than the half-precision FloatLM 830M, but matches half-precision FloatLM 3.9B in commonsense reasoning and knowledge benchmarks. However, TriLM 3.9B is also as toxic and stereotyping as FloatLM 3.9B, a model six times larger in size. Additionally, TriLM 3.9B lags behind FloatLM in perplexity on validation splits and web-based corpora but performs better on less noisy datasets like Lambada and PennTreeBank. To enhance understanding of low-bitwidth models, we are releasing 500+ intermediate checkpoints of the Spectra suite at https://github.com/NolanoOrg/SpectraSuite{https://github.com/NolanoOrg/SpectraSuite}.
Community
Repo: https://github.com/NolanoOrg/SpectraSuite
Models:
- TriLM Unpacked: https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2
- FloatLM: https://huggingface.co/collections/SpectraSuite/floatlms-668d603d5a19e1af67564812
Reddit: https://www.reddit.com/r/LocalLLaMA/comments/1e61odl/introducing_spectra_a_comprehensive_study_of/
Hi @Ayushk44 congrats on this work!
I see that models and collections are available, would be able to link it to this paper?
Here's how to do that: https://huggingface.co/docs/hub/en/model-cards#linking-a-paper.
Moreover, a paper itself can also be added to a collection :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- QQQ: Quality Quattuor-Bit Quantization for Large Language Models (2024)
- Exploring Quantization for Efficient Pre-Training of Transformer Language Models (2024)
- LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices (2024)
- Scalable MatMul-free Language Modeling (2024)
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper