They Said It Couldn’t Be Done

Community Article Published December 5, 2024

Training large language models required copyrighted data until it did not. Today we release Pleias 1.0 models, a family of fully open small language models. Pleias 1.0 models include three base models: 350M, 1.2B, and 3B parameters. They feature two specialized models for knowledge retrieval with unprecedented performance for their size on multilingual Retrieval-Augmented Generation, Pleias-Pico (350M parameters) and Pleias-Nano (1.2B parameters).

These represent the first ever models trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license. These are the first fully EU AI Act compliant models. In fact, Pleias sets a new standard for safety and openness.

Our models are:

  • multilingual, offering strong support for multiple European languages
  • safe, showing the lowest results on the toxicity benchmark
  • performant for key tasks, such as knowledge retrieval
  • able to run efficiently on consumer-grade hardware locally (CPU-only, without quantisation)

Pleias 1.0 models achieve strong multilingual performance through custom tokenizer development and curation of high-quality multilingual data. We show that our models are best-in-class in terms of language adherence. The pico model (350M) is the first in its weight class to have such broad linguistic coverage. Fully supported languages for these models include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese.

Pleias 1.0 family embodies a new approach to specialized small language models, for end applications: wound-up models. We have implemented a set of ideas and solutions during pretraining that produce a frugal yet powerful language model specifically optimized for further RAG implementations. We release two wound-up models further trained for Retrieval Augmented Generation (RAG): Pleias-pico-350m-RAG and Pleias-nano-1B-RAG. These models are designed to be implemented locally, so we prioritized frugal implementation. As our models are small, they can run smoothly, even on devices with limited RAM.

In order to do this, we have started building a new pretraining ecosystem, exclusively based on open source tools. We have done this in collaboration and support of open source AI industry leaders such as TractoAI and, of course, HuggingFace.

Training Data

We are moving away from the standard format of web archives. Instead, we use our new dataset composed of uncopyrighted and permissibly licensed data, Common Corpus. To create this dataset, we had to develop an extensive range of tools to collect, to generate, and to process pretraining.

Data Preprocessing

We created custom data processing tools. We trained a small, but reliable OCR correction model that is able to correct digitization errors at scale, for example correcting spacing issues, replacing incorrect words, and repairing broken text structures. It is small enough to run on CPU alone. That model and other OCR correction tools are available on HuggingFace.

We also developed a specialized pipeline for addressing toxic and harmful content. As many existing tools work poorly with our multilingual data, which contain historical texts and OCR errors, we trained a custom toxicity classifier, which we used to remove harmful language about minoritized groups without over-filtering our corpus. Our classifier is available on HuggingFace and further details about the procedure are in the full paper.

Synthetic Data Generation

To supplement our corpus, we have generated 30B+ words synthetically with models allowing for outputs reuse. Our design for synthetic has been guided by the necessity to preserve the linguistic and cultural diversity. In this spirit, for our 1B model, we augmented our training set by extracting ~100B words of high quality, multilingual data from OpenAlex. Using a custom processing pipeline integrating a YOLO fine-tune, we downloaded and processed over 10M pdfs.

Then, we built a synthetic data pipeline to generate knowledge-retrieval oriented instructions out of post-processed seed texts. We used the extracted OpenAlex dataset and several fine-tuned larger models to generate billions of tokens of RAG/Instruct format training data, relying on a Map-Reduce based TractoAI method.

Model Training

Pretraining code relied on Nanotron, the HuggingFace library. We provide the complete settings as yaml files as part of our release. Pleias 1.0 models transformer base model, entirely pretrained from scratch, using an architecture similar to Llama and GPT-Neox for easier deployment and inference.

The pico (350M) and base (3B) models were trained on the Jean Zay supercomputer, under compute grant #GC011015451 as part of the Grand Challenge. We developed our nano (1.2B) model in a collaboration with TractoAI, a serverless AI platform for running data and compute-intensive workloads at scale. TractoAI is built on top of a powerful open-source YTsaurus technology. To work on it, we performed several adaptations together with the TractoAI team:

  • we transformed our pre-tokenized pre-training data into TractoAI tables that efficiently store batches of tokenized sequences,
  • to work with the tokenized data tables in Nanotron, we created a dataset adapter, and added corresponding configuration options,
  • we also adapted all the file system operations with TractoAI commands, including checkpoint saving
  • finally, we used the tractorun framework to deploy and coordinate the process of distributed training in an automatic fault-tolerant manner

Using a CO2 emissions calculator, we determined that for the two smaller models, our emissions were far below that of a model of comparable size, like OpenELM, whose 300M model generated the equivalent of 1.5 tonnes of CO2 (tCO2eq) and whose 1.1B model generated approximately 5.5 tCO2eq in carbon emissions during training. Our models generated orders of magnitude less emissions than their Llama 3.2 counterparts.

# GPUs GPU type Training time (days) Pleias Carbon Emissions (tCO2eq) OpenELM (tCO2eq) Llama 3.2 (tCO2eq)
Pleias 1.0 pico (350M) 64 H100 1.92 0.5 1.5 -
Pleias 1.0 nano (1B) 192 H100 5 4 5.5 107
Pleias 1.0 base (3B) 192 H100 20 16 7 133

Custom Model Evaluations

Evaluation of small models is fraught with problems. The most popular generalist benchmarks are not suitable for evaluating small models. Instead, we develop targeted benchmarks to evaluate key capabilities that are essential to our desired downstream application. Our primary concerns were ensuring that our model

  • achieves impressive performance on RAG tasks
  • offers reliable multilingual performance
  • does not generate toxic or harmful text

RAG Performance

Finally, we evaluate the Pico and Nano models on a RAG task. As existing benchmarks are largely limited to English, we develop a custom multilingual RAG benchmark. We synthetically generate queries and small sets of documents. To evaluate, we prompted models with the query and documents. We then ran a head-to-head ELO-based tournament with GPT-4o as judge. We release the prompts and generations for all models we compared. Our nano (1.2B) model outperforms Llama 3.2 1.1B and EuroLLM 1.7B. Our pico (350M) model outperforms other models in its weight class, such as SmolLM 360M and Qwen2.5 500M, in addition to much larger models, such as Llama 3.2 1.1B and EuroLLM 1.7B.

Rank Model ELO
1 Qwen2.5-Instruct-7B 1294.6
2 Llama-3.2-Instruct-8B 1269.8
3 Pleias-nano-1.2B-RAG 1137.5
4 Llama-3.2-Instruct-3B 1118.1
5 Qwen2.5-Instruct-3B 1078.1
6 Pleias-pico-350M-RAG 1051.2
7 Llama-3.2-1B-Instruct 872.3
8 EuroLLM-1.7B-Instruct 860.0
9 SmolLM-360M-Instruct 728.6
10 Qwen2.5-0.5B-Instruct 722.2
11 SmolLM-1.7B-Instruct 706.3

Strong Multilingual Performance

A key feature we were concerned about was the tendency for multilingual models to switch to English in the middle of a generation in another language. We evaluated the extent to which various models were able to refrain from switching languages while generating text for a variety of EU languages, notably French, German, Dutch, Portuguese, and Polish. We release our evaluation script. We find that Pleias models outperform other leading open models. Performance is especially impressive in the 300M parameter weight class, which is only slightly worse than the Pleias 3B performance. Both Pleias models outperform all other models we tested. We attribute some of this to our custom tokenizer.

Model Pleias 350M SmolLM 360M Pleias 1.2B EuroLLM 1.7B Pleias 3B SmolLM 2B Llama-3.2 3B Qwen-2.5 3B
Prop. Language Adherence (↑) 89.8% 65.6% 90.4% 86.9% 90.7% 70% 71.1% 82.3%

Mitigating Toxicity

Toxic generations are also a concern, particularly in terms of compliance. The first draft of the Codes of Practice for the EU AI Act highlight this as a key aspect of evaluation that is essential for compliance. As many benchmarks focus on evaluating safety-tuned models, we developed our own benchmark to address key areas of concern in a way that would fairly evaluate our pretrained-only model. We developed a set of prompts, designed to elicit toxic generations. We targeted not only bias and stereotypes, but other kinds of harmful content, especially that related to violent and sexually explicit themes. We compare the proportion of prompts that generate toxic completions. All generations were manually annotated by experts, using concrete evaluation criteria. Annotations were done so the annotator was not aware of which model’s generations were being annotated. We are preparing a full paper to detail this procedure, in which we will release the full benchmark. Because the annotations were done manually, we were not able to compare to as many models.

Model Pleias 350M SmolLM 360M Pleias 1.2B Olmo 1B
Pct. Toxic Generations (↓) 22.9% 37.4% 32.4% 41.4%

To supplement these results with existing bias benchmarks, we show the results for the CrowS-Pairs benchmark in English and French. Our two smallest models outperform competitive models in their weight class, with the exception of the pico (350M) model for French, where the performance is very similar.

Model Prop. Biased Generations, English (↓) Prop. Biased Generations, French (↓)
Pleias 350M 0.497 (±0.012) 0.428 (±0.012)
SmolLM 350m 0.562 (±0.012) 0.399 (± 0.012)
Pleias 1.2B 0.413 (± 0.012) 0.421 (± 0.012)
Llama 3.2 1B 0.624 (± 0.012) 0.481 (± 0.012)

Demo

Our RAG model is available to use through our new application, ScholasticAI, which is open source and runs our pico (350M) model locally on your computer.

Use our models

The whole family of models is available for use through HuggingFace. We release the Pleias 1.0 models under a permissive Apache 2.0 license, meaning that models are available for use, distribution, and modification for any purpose.

In addition to our models, we have released our training corpus, training pipeline, and critical evaluations and intend to release all further datasets and evaluations in the coming weeks.

Community

Sign up or log in to comment