Finetuning for pointing, object detection task
This is a fantastic model! I really appreciate the pointing feature—it's something I’ve only seen in the Molmo model before. However, unlike Molmo, which outputs point coordinates as HTML tags in its text output, this model appears to have dedicated heads for generating points and bounding boxes, which is very impressive.
I’m particularly curious about how these new features can be fine-tuned. Do you plan to release a notebook demonstrating the process (for pointing and object detection)? A blog post explaining the model's architecture would also be incredibly helpful for understanding its unique capabilities.
Additionally, I noticed that the example code includes special methods like model.caption and model.query. Does this mean the model cannot be used like a traditional vision-language model? Is it possible to input a chat history for conversational use?
Thank you again for this amazing model!
I am working on a post detailing how the pointing/bounding box heads work, as well as scripts for finetuning. Will reply here when that's up.
The model currently cannot by used in a multi-turn conversational setting, we're focused on maximizing single-turn visual understanding since that's what is most useful for developers building vision applications.
very cool, looking forward to that vik and planning to do a vid