Dan Saattrup Nielsen

saattrupdan

AI & ML interests

NLP for low-resource languages.

Recent Activity

updated a dataset about 5 hours ago
alexandrainst/coral
liked a model about 22 hours ago
OpenLLM-France/Lucie-7B-Instruct
liked a model about 22 hours ago
OpenLLM-France/Lucie-7B
View all activity

Organizations

Flax Community's profile picture Dansk Data Science Community's profile picture DaNLP's profile picture AI Sweden Model Hub's profile picture north's profile picture Blackbird AI's profile picture ScandEval's profile picture Alexandra Institute's profile picture Job Ad Generator's profile picture LumiOpen's profile picture Danish Foundation Models's profile picture CoRal's profile picture Merge Crew's profile picture RAG Demo's profile picture TrustLLM EU's profile picture

saattrupdan's activity

reacted to davanstrien's post with šŸ”„ 2 days ago
view post
Post
2598
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

šŸ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

šŸŒ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
Ā·
reacted to davanstrien's post with šŸ¤— 2 days ago
view post
Post
2598
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

šŸ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

šŸŒ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
Ā·
New activity in allenai/OLMo-2-1124-7B 7 days ago
New activity in allenai/OLMo-2-1124-13B 7 days ago
New activity in BSC-LT/salamandraTA-2B 8 days ago
New activity in NbAiLab/nb-llama-3.1-70B 20 days ago

Change dtype to bf16

#1 opened 20 days ago by
saattrupdan