Papers
arxiv:2306.01116

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Published on Jun 1, 2023
Β· Submitted by akhaliq on Jun 5, 2023
#2 Paper of the day
Authors:
,
,
,
,
,

Abstract

Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

Community

This comment has been hidden
This comment has been hidden

I've had consistently good results with any model that was trained in this data set

Sign up or log in to comment

Models citing this paper 114

Browse 114 models citing this paper

Datasets citing this paper 13

Browse 13 datasets citing this paper

Spaces citing this paper 746

Collections including this paper 10