arxiv:2305.17154
rasgaard
rasgaard
AI & ML interests
None yet
Recent Activity
reacted
to
davanstrien's
post
with 🤗
1 day ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
🔍 What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
upvoted
a
collection
about 1 month ago
Danish Text Datasets
liked
a dataset
about 1 month ago
HuggingFaceFW/fineweb-2
Organizations
Papers
1
models
12
rasgaard/luke-base-newsgroups-finetuned
Text Classification
•
Updated
rasgaard/luke-base-newsgroups-probe
Text Classification
•
Updated
•
3
rasgaard/squeezebert-newsgroups-finetuned
Text Classification
•
Updated
•
2
rasgaard/squeezebert-newsgroups-probe
Text Classification
•
Updated
•
1
rasgaard/distilbert-newsgroups-finetuned
Text Classification
•
Updated
rasgaard/distilbert-newsgroups-probe
Text Classification
•
Updated
•
1
rasgaard/bert-newsgroups-finetuned
Text Classification
•
Updated
•
1
rasgaard/bert-newsgroups-probe
Text Classification
•
Updated
rasgaard/roberta-newsgroups-finetuned
Text Classification
•
Updated
rasgaard/roberta-newsgroups-probe
Text Classification
•
Updated