Post
1719
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license π
ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093
> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos β―οΈ
> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM π¬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡οΈ
> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos β―οΈ
> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM π¬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡οΈ
> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.