VLM - a poonyZ Collection

poonyZ 's Collections

omni

T2I

agi

fancy

VLM

llm

VLM

updated 6 days ago

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 8

Note 值得关注
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Paper • 2411.18203 • Published Nov 27, 2024 • 32
Towards Interpreting Visual Information Processing in Vision-Language Models

Paper • 2410.07149 • Published Oct 9, 2024 • 1
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 22
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Paper • 2411.15453 • Published Nov 23, 2024
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 16
I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token

Paper • 2412.06676 • Published Dec 9, 2024 • 9

Note 还行
From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding

Paper • 2412.06474 • Published Dec 9, 2024

Note 不好说
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Paper • 2412.09585 • Published Dec 12, 2024 • 10

Note 值得关注
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 35

Note 还行
Analyzing The Language of Visual Tokens

Paper • 2411.05001 • Published Nov 7, 2024 • 23

Note 值得关注
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Paper • 2412.13871 • Published 28 days ago • 18
FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published 29 days ago • 13
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Paper • 2412.18619 • Published about 1 month ago • 53

Note 持续关注
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Paper • 2412.19326 • Published 20 days ago • 18
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published 22 days ago • 68
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

Paper • 2501.01904 • Published 12 days ago • 31