-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 27 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 13 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 47 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 29
Collections
Discover the best community collections!
Collections including paper arxiv:2405.20204
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 41 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Paper • 2406.06525 • Published • 67 -
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper • 2406.06469 • Published • 25 -
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Paper • 2406.04271 • Published • 29 -
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper • 2406.02657 • Published • 38
-
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper • 2405.20204 • Published • 35 -
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Paper • 2412.08802 • Published • 4 -
jinaai/jina-clip-v2
Zero-Shot Image Classification • Updated • 27.6k • 160 -
jinaai/jina-clip-v1
Feature Extraction • Updated • 50.4k • 235
-
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 79 -
Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
Paper • 2405.06932 • Published • 17 -
Gecko: Versatile Text Embeddings Distilled from Large Language Models
Paper • 2403.20327 • Published • 48 -
Multilingual E5 Text Embeddings: A Technical Report
Paper • 2402.05672 • Published • 21
-
Scalable Diffusion Models with Transformers
Paper • 2212.09748 • Published • 17 -
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Paper • 2311.15127 • Published • 12 -
Learning Transferable Visual Models From Natural Language Supervision
Paper • 2103.00020 • Published • 11 -
U-Net: Convolutional Networks for Biomedical Image Segmentation
Paper • 1505.04597 • Published • 9
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 125 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 51 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 13 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65
-
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
Paper • 2401.13313 • Published • 5 -
BAAI/Bunny-v1_0-4B
Text Generation • Updated • 267 • 9 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 102 -
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper • 2405.20204 • Published • 35
-
YOLO-World: Real-Time Open-Vocabulary Object Detection
Paper • 2401.17270 • Published • 35 -
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 12 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 16 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 27