Abstract
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multimodal Autoregressive Pre-training of Large Vision Encoders (2024)
- PruneVid: Visual Token Pruning for Efficient Video Large Language Models (2024)
- Moto: Latent Motion Token as the Bridging Language for Robot Manipulation (2024)
- Autoregressive Video Generation without Vector Quantization (2024)
- RobustFormer: Noise-Robust Pre-training for images and videos (2024)
- Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models (2024)
- Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi, I was reading the paper and noticed that the frames per second (fps) of the videos used during training are not explicitly mentioned. Since I’m not very familiar with this field, I’m wondering if this is a relevant detail for the learning process. Does the choice of fps impact what the model learns, and could you clarify the fps used in the datasets or after sampling?
Thanks in advance for your help!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper