Accelerating Language Model Inference with Mixture of Attentions
Language Models (LLMs) are revolutionizing the world, but their immense computational demands have made them expensive and slow to deploy. Speculative decoding offers a promising solution by leveraging smaller models to predict future tokens more efficiently, verified by the larger LLM. However, speculative decoding techniques face challenges like partial observability and inefficient training. In this post, we explore a novel approach called Mixture of Attentions, which significantly enhances speculative decoding by addressing these limitations.
Why Read this post?
In short SOTA inference results! The Mixture of Attentions architecture is designed to optimize speculative decoding in both single-device and client-server settings. The observability, training efficiency, and flexibility improvements make this architecture more robust and adaptable to a wide range of use cases. Precisely, the combination of LSA, CA, and TLI results in faster and more accurate token generation, with the following key benefits:
- 9.5% Faster Decoding compared to EAGLE-2: Compared to the previous state-of-the-art speculative decoding method (EAGLE-2), Mixture of Attention achieves a 9.5% increase in decoding speed.
- 25% Higher Acceptance Rate: By improving the smaller model’s understanding of the large model’s state and training it on policy, the Mixture of Attention increases the number of tokens the large model accepts by 25%.
- Adaptability to Client-Server Scenarios: The architecture is especially effective in client-server deployments, where the smaller model can run on a client device and continue generating tokens even if the server (where the larger model is hosted) becomes unavailable.
- We have also shared the checkpoint for you to use. You can find the model here: https://huggingface.co/huawei-noah/MOASpec-Llama-3-8B-Instruct
- We have also implemented the approach in vLLM. You can find the code here: https://github.com/huawei-noah/HEBO/tree/mixture-of-attentions/
Understanding Speculative Decoding
As LLMs grow in size, so do their computational demands. These models' auto-regressive nature, where each new token is generated based on the previous sequence, makes them particularly slow and expensive to deploy in real-time applications. Speculative decoding offers an innovative solution to this problem by introducing a smaller model that can “draft” tokens, which the larger model then verifies. This method helps reduce the burden on the large model, speeding up the overall token generation process.
At its core, speculative decoding is a two-step process involving a smaller, efficient model (often called the draft model) and a larger, more powerful model (the verification model). The draft model generates a sequence of tokens, which are speculative guesses about what the larger model would generate. These tokens are then sent to the larger model for verification, which can accept or reject them.
- Drafting: The smaller model proposes future tokens based on the provided context.
- Verification: The larger model checks the tokens for correctness. If tokens in the sequence don’t match the large model’s prediction, the entire sequence is discarded, and the process restarts. This process of drafting and verification continues until the model reaches the desired output length. The essential advantage here is that, in many cases, the smaller model can accurately predict several tokens, thereby reducing the number of forward passes required from the larger model.
Current Challenges in Speculative Decoding
Despite its promise, speculative decoding has its challenges. Two of the most pressing issues are partial observability and off-policy training.
- Partial Observability: In traditional speculative decoding approaches, the smaller model doesn’t have access to all the information the larger model uses. Specifically, it lacks access to the entire state of the large model, including key activations and hidden states from deeper layers. This can lead to suboptimal token predictions because the smaller model operates with only a partial view of the system. The result is more frequent mismatches during verification, leading to discarded sequences and inefficiency.
- Off-Policy Training: Training the smaller model is another area where current methods fall short. The smaller model is typically trained under ideal conditions, assuming it receives perfect inputs from the larger model. However, in real-world usage, the smaller model will often have to generate its own predictions, which may not always be accurate. This mismatch between training and inference is known as off-policy training, and it can cause a significant drop in performance when the model is deployed. The longer the smaller model drafts tokens on its own, the more likely it is to drift from the correct sequence, leading to increased errors.
Traditional Methods vs. Mixture of Attentions
Existing speculative decoding models, such as EAGLE and MEDUSA, attempt to address these challenges but have limitations. For example, EAGLE leverages activations from the large model to guide the smaller model’s predictions but still suffers from partial observability and struggles with off-policy training.
In the next section, we will introduce the Mixture of attention architecture, which provides a more grounded solution to these problems. By leveraging multiple attention mechanisms, the Mixture of Attentions approach enhances the smaller model’s ability to draft tokens accurately while training it in a more realistic, on-policy setting.
Introducing the Mixture of Attention Architecture
To overcome the challenges of speculative decoding, the paper introduces a novel architecture called Mixture of Attentions, which brings three major innovations to speculative decoding: Layer Self-Attention (LSA), Cross-Attention (CA), and Target Layer Inference (TLI). Together, these components address the issues of partial observability and off-policy training while also offering a more flexible approach to balancing speed and accuracy.
Critical Components of the Mixture of Attentions Architecture:
Layer Self-Attention (LSA): One of the most significant limitations in traditional speculative decoding is that the smaller model doesn’t have complete information about the internal state of the larger model. The smaller model can only observe the final layer’s activations of the large model, which leads to partial observability — an incomplete understanding of the context.
The Mixture of Attentions architecture introduces Layer Self-Attention (LSA) to address this. This attention mechanism aggregates key activations from all layers of the large model rather than just the final layer. By summarizing information across multiple layers, LSA provides the smaller model with a much richer understanding of the current state, allowing it to make more informed token predictions.
How it Works:
- The larger model produces activations for every layer during token generation.
- LSA applies attention to these activations, extracting relevant information from every layer and reducing the dimensionality to make it manageable for the smaller model.
- This enhanced view reduces the likelihood of incorrect token drafts, improving the overall efficiency of the decoding process.
Cross-Attention (CA): Another key limitation in traditional speculative decoding is the lack of on-policyness during training. In speculative decoding, the smaller model often needs to generate tokens based on its previous outputs, not just the perfect outputs from the larger model. However, most models are trained off-policy, meaning they are trained under ideal conditions where perfect inputs are assumed. This discrepancy between training and real-world usage significantly drops performance when the model is deployed.
CA solves this by allowing the smaller model to learn in a more realistic, on-policy setting. The CA mechanism enables the smaller model to predict multiple future tokens at once while relying on activations from the larger model only up to the current token. By simulating real-world conditions during training, the smaller model becomes better equipped to handle errors and uncertainties during actual inference.
How it Works:
- During the drafting phase, the smaller model uses CA to predict a sequence of tokens (instead of one token at a time).
- The cross-attention layer uses activations from the larger model up to the current token but allows the smaller model to generate multiple tokens without needing constant feedback from the large model.
- This makes the smaller model T-step bounded, meaning it can draft up to T future tokens in a single pass, reducing the computational cost and making training more efficient.
Target Layer Inference (TLI): Traditional speculative decoding assumes that the smaller model should always predict the activations of the large model's final layer. However, the authors challenge this assumption with Target Layer Inference (TLI), allowing the smaller model to target the larger model's deeper layers.
The intuition here is that predicting intermediate layers can be easier than predicting the final layer’s output, which may still lead to accurate token predictions. By targeting different layers, TLI enables a trade-off between speed and accuracy: targeting earlier layers is faster but might result in less accurate predictions, while targeting later layers increases accuracy but requires more computation.
How it Works:
- The architecture introduces a hyperparameter N defining the smaller model's target layer.
- If N = 0, the smaller model targets the final layer (standard approach). If N > 0, it targets earlier layers.
- This flexibility allows the model to adjust its behavior depending on the task's requirements, balancing speed and accuracy.
Key Contributions and Results
The Mixture of attention architecture significantly advances the field of speculative decoding by addressing the challenges of partial observability and off-policy training while also offering flexibility in balancing speed and accuracy. In this section, we’ll highlight the paper's major contributions and review the experimental results that demonstrate the architecture’s effectiveness.
Decoding Speedup: One of the main goals of speculative decoding is to accelerate the inference process without sacrificing accuracy. The Mixture of Attentions architecture achieves this by introducing a more informed and efficient drafting process. By leveraging Layer Self-Attention (LSA) and Cross-Attention (CA), the smaller model can draft tokens more accurately while reducing the number of verification cycles required by the larger model.
Key Result:
- The Mixture of Attention achieves a 9.5% decoding speed compared to the previous state-of-the-art model, EAGLE-2.
- This improvement is especially pronounced in single-device settings, where the Mixture of Attentions significantly reduces the time required to generate responses while maintaining high accuracy.
Higher Acceptance Rate: Another significant contribution is the increase in the acceptance rate of tokens generated by the smaller model. Thanks to Layer Self-Attention, the smaller model has a more complete view of the larger model’s internal state, which allows it to draft tokens that are more likely to be accepted during the verification process. Additionally, Cross-Attention improves the on-policyness of the smaller model’s training, increasing the likelihood that its drafts will be accepted.
Key Result:
- The Mixture of attention architecture results in a 25% higher acceptance rate than EAGLE-2.
- This means that the smaller model generates sequences more likely to be approved by the larger model, reducing the number of discarded tokens and improving overall efficiency.
Client-Server Deployment: One of the most exciting aspects of the Mixture of Attentions architecture is its effectiveness in a client-server deployment scenario. In this setup, the smaller model runs on a client device (such as a mobile phone), while the larger model is hosted on a server. The smaller model generates tokens and sends them to the server for verification, but in the event of a network disconnection, it can continue generating tokens autonomously.
This capability is critical for edge computing and situations where continuous access to a powerful server is not guaranteed. By allowing the smaller model to operate independently when the server is unreachable, Mixture of Attentions enables more robust and flexible deployment of LLMs in real-world applications.
Key Result: The Mixture of Attentions achieves state-of-the-art latencies in client-server deployments, even under challenging network conditions (e.g., 4G and 5G networks). In cases of complete disconnection, the Mixture of Attentions model continues generating tokens with higher accuracy than other speculative decoding methods, which would otherwise fail without server access.
Adaptability with Target Layer Inference (TLI): Introducing Target Layer Inference (TLI) adds another layer of flexibility to the Mixture of attention architecture. By allowing the smaller model to target deeper layers of the large model, the architecture can adapt its behavior depending on the task. This flexibility enables a trade-off between speed and accuracy:
- If the goal is a faster token generation, the smaller model can target earlier layers of the large model, reducing the computational cost.
- If accuracy is more important, the smaller model can target deeper layers, increasing the quality of its predictions.
Key Result: The architecture demonstrated that adjusting the target layer (N in the TLI mechanism) allows for fine-tuning the balance between speed and accuracy based on specific application needs.
Implications and Future Directions
The Mixture of Attentions architecture offers groundbreaking improvements in speculative decoding, with significant implications for both research and real-world applications of large language models (LLMs). By addressing core challenges like partial observability and off-policy training, this architecture paves the way for more efficient and scalable LLM deployment, particularly in edge computing scenarios where computational resources are limited.
Implications for Large Language Model Deployment The rapid adoption of LLMs in various industries — from healthcare and education to finance and customer service — means that improving the efficiency of these models is more critical than ever. The Mixture of Attentions architecture provides several key advantages for deploying LLMs in practical settings:
- Faster Inference: The 9.5% increase in decoding speed makes real-time applications of LLMs, such as chatbots and virtual assistants, more responsive. This could translate into smoother end-user interactions, even when using models with billions of parameters.
- Edge and Client-Server Computing: In scenarios where LLMs need to be deployed on edge devices (like smartphones or IoT devices), the smaller model's ability to continue generating tokens independently in case of server disconnections is a game-changer. This opens up new possibilities for using LLMs in offline or low-connectivity environments, such as remote locations or autonomous systems.
- Energy Efficiency: With increasing concerns about the energy consumption of large models, improving the acceptance rate of speculative decoding while reducing the reliance on large models can lead to lower computational costs and energy consumption, making LLM deployments more sustainable.
Implications for Model Training: The architectural changes introduced by Mixture of Attentions also offer benefits during the training phase of LLMs:
- Improved Training Efficiency: The Cross-Attention (CA) layer enables more on-policy training, where the smaller model is trained in conditions that closely mimic real-world inference scenarios. This reduces the performance drop often seen when models transition from training to deployment.
- More Adaptive Models: By allowing the smaller model to target different layers of the large model (via Target Layer Inference), the architecture offers flexibility in how models are trained and optimized. Developers can adjust the depth of inference based on the task at hand, balancing computational cost and accuracy dynamically.
Future Directions: The Mixture of Attention architecture opens up exciting avenues for future research and development. Some potential directions for future work include:
- Dynamic Target Layer Inference: One potential extension of this work is to enable the model to dynamically select the optimal target layer (N in Target Layer Inference) based on the complexity of the task or the current network conditions in client-server scenarios. This would allow for even more efficient and adaptable deployments, where the model could automatically balance speed and accuracy as needed.
- Privacy-Preserving Speculative Decoding: In client-server setups, there’s potential to explore privacy-preserving approaches where sensitive parts of a user’s input remain on the client side, and only non-sensitive data is sent to the server. The Mixture of Attentions architecture could be adapted to ensure that certain activations or token sequences are processed locally, allowing for privacy-sensitive deployments of LLMs in areas like healthcare or legal services.
- Extension to Other Domains: While this work focuses on token generation in LLMs, the principles of speculative decoding and the Mixture of Attentions could be extended to other domains where predictive models are used. For example, it could be applied to machine translation, code generation, or even robotic control systems, where fast, accurate predictions are critical. Advanced Edge Computing Applications: With the increasing trend toward decentralized and edge computing, a Mixture of attention could play a key role in applications like autonomous vehicles, smart homes, and real-time translation devices. These systems require the ability to operate with minimal latency, and the Mixture of Attentions’ ability to handle disconnected operations is a valuable feature.
This architecture's reduced energy consumption and enhanced capabilities for offline and low-connectivity settings also make it well-suited for sustainable AI initiatives. As concerns about AI's environmental impact grow, the ability to deploy models more energy—efficiently without sacrificing performance will be increasingly important.
Please like, share, and follow up if you find this useful!