Papers
arxiv:2501.05453

An Empirical Study of Autoregressive Pre-training from Videos

Published on Jan 9
· Submitted by brjathu on Jan 10
#3 Paper of the day
Authors:
,
,

Abstract

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This comment has been hidden
This comment has been hidden
This comment has been hidden

Checkout detailed paper explanation : https://gyanendradas.substack.com/p/toto-paper-explained

Hi, I was reading the paper and noticed that the frames per second (fps) of the videos used during training are not explicitly mentioned. Since I’m not very familiar with this field, I’m wondering if this is a relevant detail for the learning process. Does the choice of fps impact what the model learns, and could you clarify the fps used in the datasets or after sampling?

Thanks in advance for your help!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.05453 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.05453 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.05453 in a Space README.md to link it from this page.

Collections including this paper 13