MiniMaxAI/MiniMax-VL-01
Text Generation
β’
Updated
β’
40
β’
121
Note A non transformer based ( ViT-MLP-LLM framework) VLM
Note 456B LLM with 1M tokens training context
Note End-side multimodal LLM that supports real time conversation and video understanding.
Note A unified model for dense grounded understanding of images & videos.
Note A multimodel dataset for vision language pretraining , includes 6.5M images + 0.8B text from 22k hours of instructional videos