Introduction

We're excited to introduce our latest model, the Lite Oute 2 Mamba2Attn 250M.
This is our third generation model featuring the new Mamba2 architecture with attention layers.
If you're interested in more technical details that covers the training process, architecture, and performance: Read the full blog post here

Model Variants

Training Details

The model was pre-trained on 30 billion tokens using a balanced mixture of datasets:

  • 50% dclm-baseline-1.0
  • 50% fineweb-edu

Base model training was conducted on single NVIDIA 4090 and NVIDIA H100 GPUs, with the following key parameters:

  • Max learning rate: 4e-4
  • Min learning rate: 1e-4
  • Block size: 4096
  • Token batches: ~100k tokens

For instruction training, we first trained the model with Supervised Fine-tuning (SFT) then further refined the model using DPO training.

Benchmark Results

Benchmark Lite-Oute-2-Mamba2Attn-250M-Instruct Lite-Oute-1-300M-Instruct Lite-Mistral-150M-v2-Instruct
ARC-C (0-shot) 26.71 26.02 -
ARC-E (0-shot) 53.70 49.79 39.52
HellaSWAG (0-shot) 38.19 34.50 31.01
MMLU (0-shot) 25.13 24.00 25.28
OpenBookQA (0-shot) 32.20 32.20 28.40
PIQA (0-shot) 66.59 65.40 60.23
Winogrande (0-shot) 53.28 53.75 51.78
ARC-C (5-shot) 27.82 26.37 -
ARC-E (5-shot) 57.28 51.43 46.30
HellaSWAG (5-shot) 37.71 34.93 30.73
MMLU (5-shot) 25.81 25.87 -
OpenBookQA (5-shot) 31.80 31.40 28.00
PIQA (5-shot) 66.38 65.07 62.68
Winogrande (5-shot) 51.14 52.01 51.30

Interfacing with the Instruct Model

To ensure optimal performance, please use the following template when interacting with the model:

<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n
{The model's response}</s>

When prompting the model for a response, your input should be structured as follows:

<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n

This format ensures the best performance, as the model was fine-tuned using this specific template.

Recommended Settings

  • Temperature: 0.1 - 0.4
  • Repetition Penalty: 1.1 - 1.14

If you're using a Hugging Face tokenizer, make sure to use tokenizer.apply_chat_template or adjust your interface to match the template above.

Let's look at some real world examples

<s>user
What are three key benefits of electric vehicles compared to traditional gasoline-powered cars?</s>
<s>assistant
Electric vehicles have several advantages over gasoline-powered cars. Here are three key benefits:
1. Efficient: Electric vehicles use less fuel and emit fewer emissions than gasoline-powered cars, which can save consumers money on their monthly bills.
2. Fuel economy: Electric vehicles produce more power per mile driven, which means they consume less fuel overall. This helps reduce the need for refueling stations and reduces carbon emissions from transportation.
3. Flexibility: Electric vehicles can be adapted to different driving conditions, such as snowy or rainy days, and can accommodate different types of terrain. This makes them easier to operate in urban areas with limited parking spaces.
Overall, electric vehicles offer a range of benefits that make them an attractive choice for many drivers. They provide a cost-effective alternative to gasoline-powered cars, reducing the need for frequent refueling stations, and making it easier to drive long distances without having to worry about running out of gas.</s>
<s>user
Who wrote the play "Romeo and Juliet"?</s>
<s>assistant
The play "Romeo and Juliet" by William Shakespeare is a classic example of a tragic love story. It tells the story of Romeo and Juliet, a young couple who fall in love with each other but ultimately lose their lives due to their respective flaws. The play was written during the Renaissance period and it has been performed around the world since then.</s>

Usage with HuggingFace transformers

Model weights were converted from the original Mamba2 implementation to be Hugging Face compatible.
Due to the lack of official support for Mamba2 attention layers in Hugging Face Transformers, custom modeling files are included.
The implementation of Mamba2 with attention in the modeling files comes from Pull Request #32027 in the Hugging Face Transformers repository: https://github.com/huggingface/transformers/pull/32027

To speed up inference, we recommend installing mamba-ssm and flash attention 2.

mamba-ssm:

pip install causal-conv1d>=1.4.0
pip install mamba-ssm

flash attention 2:

pip install flash-attn --no-build-isolation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(
    "OuteAI/Lite-Oute-2-Mamba2Attn-Instruct",
    # To allow custom modeling files
    trust_remote_code=True,

    # If you have installed flash attention 2
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained("OuteAI/Lite-Oute-2-Mamba2Attn-Instruct")

def generate_response(message: str, temperature: float = 0.1, repetition_penalty: float = 1.12) -> str:
    # Apply the chat template and convert to PyTorch tensors
    messages = [
        {"role": "system", "content": "You are an AI assistant. You will be given a task. You must generate a detailed answer."},
        {"role": "user", "content": message}
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt"
    ).to(device)
    # Generate the response
    output = model.generate(
        input_ids,
        max_length=256,
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        do_sample=True
    ) 
    # Decode the generated output
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text
message = "Who wrote the play Romeo and Juliet?"
response = generate_response(message)
print(response)

Fine-Tuning for Specific Tasks:

For optimal task-specific performance, it is recommended to fine-tune the model on datasets relevant to the intended use case.
Fine-tuning can significantly improve the model's accuracy, relevance, and depth of knowledge in specialized domains or for particular types of tasks.
This process allows the model to adapt its pre-trained knowledge to the nuances and requirements of specific applications, potentially mitigating some of the weaknesses observed in general use.

Conclusion

The Lite-Oute-2-Mamba2Attn-250M-Instruct model shows significant improvements over previous versions, particularly in zero-shot and few-shot learning tasks.
Despite its smaller size, it outperforms older models in most benchmarks, demonstrating better common sense reasoning, language understanding, and general knowledge.
While it can provide coherent responses to various queries, it has limitations due to its small size (250 million parameters).
Users should be aware that it may produce incorrect outputs and should not be relied upon for critical decisions without verification.
It's best suited for lightweight applications or as a basis for further development.

Disclaimer

By using this model, you acknowledge that you understand and assume the risks associated with its use. You are solely responsible for ensuring compliance with all applicable laws and regulations. We disclaim any liability for problems arising from the use of this open-source model, including but not limited to direct, indirect, incidental, consequential, or punitive damages. We make no warranties, express or implied, regarding the model's performance, accuracy, or fitness for a particular purpose. Your use of this model is at your own risk, and you agree to hold harmless and indemnify us, our affiliates, and our contributors from any claims, damages, or expenses arising from your use of the model.

Downloads last month
12
Safetensors
Model size
252M params
Tensor type
F32
·
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Collections including OuteAI/Lite-Oute-2-Mamba2Attn-250M-Instruct