arxiv:2406.13469
Kenneth C. Enevoldsen
KennethEnevoldsen
AI & ML interests
NLP, multimodal learning, Scandinavian NLP, Theory of Mind, Medical NLP, Psychiatry
Recent Activity
new activity
1 day ago
danish-foundation-models/danish-dynaword:Add datasets quality metrics
reacted
to
davanstrien's
post
with π€
1 day ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
π What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
π Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
reacted
to
davanstrien's
post
with π₯
1 day ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
π What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
π Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
Organizations
models
12
KennethEnevoldsen/dfm-sentence-encoder-large
Feature Extraction
β’
Updated
β’
71
β’
1
KennethEnevoldsen/munin_mistral-7b
Text Generation
β’
Updated
β’
17
β’
1
KennethEnevoldsen/munin-e5
Feature Extraction
β’
Updated
β’
2
β’
1
KennethEnevoldsen/munin-7b-e5
Updated
β’
3
KennethEnevoldsen/munin-neuralbeagle-7b-e5
Updated
β’
2
KennethEnevoldsen/dacy-large-encoder
Feature Extraction
β’
Updated
β’
4
KennethEnevoldsen/dfm-sentence-encoder-large-exp2-no-lang-align
Sentence Similarity
β’
Updated
β’
5.7k
β’
1
KennethEnevoldsen/dfm-sentence-encoder-small
Sentence Similarity
β’
Updated
β’
3
KennethEnevoldsen/dfm-sentence-encoder-medium-v1
Sentence Similarity
β’
Updated
β’
11
KennethEnevoldsen/dfm-sentence-encoder-large-exp1
Sentence Similarity
β’
Updated
β’
8