MutiModal_Paper - a L-Hongbin Collection

L-Hongbin 's Collections

MutiModal_Paper

LLM

MutiModal_Dataset

Optimizer_Papers

MutiModal_Paper

updated about 9 hours ago

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 53
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12, 2024 • 27
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 71
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 43
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Paper • 2411.14347 • Published Nov 21, 2024 • 13
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 16
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

Paper • 2411.14762 • Published Nov 22, 2024 • 11
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Paper • 2411.02545 • Published Nov 4, 2024 • 1
Hymba: A Hybrid-head Architecture for Small Language Models

Paper • 2411.13676 • Published Nov 20, 2024 • 40
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 18
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 78
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Paper • 2411.17686 • Published Nov 26, 2024 • 19
DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

Paper • 2411.17223 • Published Nov 26, 2024 • 5
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Paper • 2411.15411 • Published Nov 23, 2024 • 7
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Paper • 2411.14522 • Published Nov 21, 2024 • 31
Knowledge Transfer Across Modalities with Natural Language Supervision

Paper • 2411.15611 • Published Nov 23, 2024 • 15
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Paper • 2411.18363 • Published Nov 27, 2024 • 9
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Paper • 2411.15241 • Published Nov 22, 2024 • 5
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Paper • 2411.17787 • Published Nov 26, 2024 • 11
On Domain-Specific Post-Training for Multimodal Large Language Models

Paper • 2411.19930 • Published Nov 29, 2024 • 25
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Paper • 2409.19603 • Published Sep 29, 2024 • 19
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27, 2024 • 53
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Paper • 2412.03248 • Published Dec 4, 2024 • 26
CompCap: Improving Multimodal Large Language Models with Composite Captions

Paper • 2412.05243 • Published Dec 6, 2024 • 18
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 59
POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 52
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 35
Learned Compression for Compressed Learning

Paper • 2412.09405 • Published Dec 12, 2024 • 11
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Paper • 2412.13871 • Published 28 days ago • 18
AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

Paper • 2412.14123 • Published 28 days ago • 11
FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published 29 days ago • 13
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Paper • 2412.05939 • Published Dec 8, 2024 • 15
Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Paper • 2412.04429 • Published Dec 5, 2024
syp115/DCE-1M

Viewer • Updated 26 days ago • 1.09M • 56 • 2
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published 5 days ago • 26