saattrupdan (Dan Saattrup Nielsen)

updated a dataset about 5 hours ago

alexandrainst/coral

Viewer • Updated about 5 hours ago • 239k • 986 • 13

liked 2 models about 22 hours ago

OpenLLM-France/Lucie-7B-Instruct

Text Generation • Updated 1 minute ago • 222 • 8

OpenLLM-France/Lucie-7B

Text Generation • Updated about 4 hours ago • 47 • 5

upvoted a collection about 22 hours ago

Lucie LLM

Collection

Open weights LLM for French, English, German, Spanish and Italian • 6 items • Updated 1 day ago • 13

reacted to davanstrien's post with 🔥 2 days ago

Post

2598

Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

🔍 What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html

1 reply

·

liked a model 2 days ago

davanstrien/scandi-fine-web-cleaner

Text Classification • Updated 1 day ago • 31 • 6

reacted to davanstrien's post with 🤗 2 days ago

Post

2598

Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

🔍 What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html