Question Answering
Safetensors
English
mllama

LlamaV-o1

LlamaV-o1 logo

Overview

LlamaV-o1 is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks. Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search, LlamaV-o1 demonstrates exceptional performance across diverse benchmarks. It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception, mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.

The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach, LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations, LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.

Key Features:

  • Model Size: 11 billion parameters.
  • Architecture: Based on the Llama (Large Language Model Architecture) family.
  • Fine-Tuning: Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks.
  • Applications: Ideal for use cases such as conversational agents, educational tools, content creation, and more.

Model Details

  • Developed By: MBZUAI
  • Model Version: v0.1
  • Release Date: 13th January 2025
  • Training Dataset: Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora.
  • Framework: Pytorch

Intended Use

LlamaV-o1 is designed for a wide range of NLP tasks, including but not limited to:

  • Text Generation
  • Sentiment Analysis
  • Text Summarization
  • Question Answering
  • Chain-of-Thought Reasoning

Out-of-Scope Use

The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.

Training Procedure

  • Fine-Tuning: The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications.
  • Optimizations: Includes inference scaling optimizations to balance performance and computational efficiency.

Evaluation

Benchmarks

LlamaV-o1 has been evaluated on a suite of benchmark tasks:

Limitations

While the model performs well on a broad range of tasks, it may struggle with: - Highly technical, domain-specific knowledge outside the training corpus. - Generating accurate outputs for ambiguous or adversarial prompts.

Usage

from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "omkarthawakar/LlamaV-o1"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

Please refer to llamav-o1.py for inference.

Results

Table 1: Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models.

Model GPT-4o Claude-3.5 Gemini-2.0 Gemini-1.5 Pro Gemini-1.5 Flash GPT-4o Mini Llama-3.2 Vision Mulberry Llava-CoT LlamaV-o1 (Ours)
Final Answer 59.28 61.35 61.16 61.35 54.99 56.39 48.40 51.90 54.09 56.49
Reasoning Steps 76.68 72.12 74.08 72.12 71.86 74.05 58.37 63.86 66.21 68.93

Training Data

LlamaV-o1 is trained on the LLaVA-CoT-100k dataset. We have formatted training sample for multi-step reasoning.

Training Procedure

LlamaV-o1 model is finetuned on llama-recipes. Detailed Training procedure will be coming soon!

Citation

If you find this paper useful, please consider staring 🌟 our Github repo and citing 📑 our paper:

@misc{thawakar2025llamavo1,
      title={LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs}, 
      author={Omkar Thawakar and Dinura Dissanayake and Ketan More and Ritesh Thawkar and Ahmed Heakl and Noor Ahsan and Yuhao Li and Mohammed Zumri and Jean Lahoud and Rao Muhammad Anwer and Hisham Cholakkal and Ivan Laptev and Mubarak Shah and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2501.06186},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.06186}, 
}
Downloads last month
736
Safetensors
Model size
10.7B params
Tensor type
BF16
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for omkarthawakar/LlamaV-o1

Finetuned
(77)
this model

Datasets used to train omkarthawakar/LlamaV-o1

Space using omkarthawakar/LlamaV-o1 1