PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Paper • 2410.13861 • Published Oct 17, 2024 • 53
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Paper • 2411.07975 • Published Nov 12, 2024 • 27
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published Nov 15, 2024 • 71
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published Nov 21, 2024 • 43
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding Paper • 2411.14347 • Published Nov 21, 2024 • 13
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Paper • 2411.14982 • Published Nov 22, 2024 • 16
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction Paper • 2411.14762 • Published Nov 22, 2024 • 11
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives Paper • 2411.02545 • Published Nov 4, 2024 • 1
Hymba: A Hybrid-head Architecture for Small Language Models Paper • 2411.13676 • Published Nov 20, 2024 • 40
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18, 2024 • 18
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published Nov 26, 2024 • 78
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration Paper • 2411.17686 • Published Nov 26, 2024 • 19
DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting Paper • 2411.17223 • Published Nov 26, 2024 • 5
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity Paper • 2411.15411 • Published Nov 23, 2024 • 7
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI Paper • 2411.14522 • Published Nov 21, 2024 • 31
Knowledge Transfer Across Modalities with Natural Language Supervision Paper • 2411.15611 • Published Nov 23, 2024 • 15
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding Paper • 2411.18363 • Published Nov 27, 2024 • 9
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality Paper • 2411.15241 • Published Nov 22, 2024 • 5
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient Paper • 2411.17787 • Published Nov 26, 2024 • 11
On Domain-Specific Post-Training for Multimodal Large Language Models Paper • 2411.19930 • Published Nov 29, 2024 • 25
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos Paper • 2409.19603 • Published Sep 29, 2024 • 19
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27, 2024 • 53
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning Paper • 2412.03248 • Published Dec 4, 2024 • 26
CompCap: Improving Multimodal Large Language Models with Composite Captions Paper • 2412.05243 • Published Dec 6, 2024 • 18
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 59
POINTS1.5: Building a Vision-Language Model towards Real World Applications Paper • 2412.08443 • Published Dec 11, 2024 • 38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Paper • 2412.08737 • Published Dec 11, 2024 • 52
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Paper • 2412.09604 • Published Dec 12, 2024 • 35
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer Paper • 2412.13871 • Published 28 days ago • 18
AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities Paper • 2412.14123 • Published 28 days ago • 11
FastVLM: Efficient Vision Encoding for Vision Language Models Paper • 2412.13303 • Published 29 days ago • 13
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models Paper • 2412.05939 • Published Dec 8, 2024 • 15
Grounding Descriptions in Images informs Zero-Shot Visual Recognition Paper • 2412.04429 • Published Dec 5, 2024
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models Paper • 2501.05767 • Published 5 days ago • 26