VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper • 2406.12275 • Published Jun 18, 2024 • 30
TroL: Traversal of Layers for Large Language and Vision Models Paper • 2406.12246 • Published Jun 18, 2024 • 35
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning Paper • 2406.15334 • Published Jun 21, 2024 • 9
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning Paper • 2406.12742 • Published Jun 18, 2024 • 15
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published Jun 26, 2024 • 29
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models Paper • 2406.17294 • Published Jun 25, 2024 • 11
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning Paper • 2406.17770 • Published Jun 25, 2024 • 19
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 60
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27, 2024 • 53
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published Jun 17, 2024 • 38
Unifying Multimodal Retrieval via Document Screenshot Embedding Paper • 2406.11251 • Published Jun 17, 2024 • 10
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding Paper • 2406.19263 • Published Jun 27, 2024 • 10
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published Jul 3, 2024 • 93
TokenPacker: Efficient Visual Projector for Multimodal LLM Paper • 2407.02392 • Published Jul 2, 2024 • 21
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Paper • 2406.08085 • Published Jun 12, 2024 • 13
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published Jul 2, 2024 • 22
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation Paper • 2407.06135 • Published Jul 8, 2024 • 21
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10, 2024 • 40
SEED-Story: Multimodal Long Story Generation with Large Language Model Paper • 2407.08683 • Published Jul 11, 2024 • 22
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model Paper • 2407.16198 • Published Jul 23, 2024 • 13
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16, 2024 • 98
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation Paper • 2408.15881 • Published Aug 28, 2024 • 21
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture Paper • 2409.02889 • Published Sep 4, 2024 • 55
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 59