Kenneth C. Enevoldsen's picture

Kenneth C. Enevoldsen

KennethEnevoldsen

·

AI & ML interests

NLP, multimodal learning, Scandinavian NLP, Theory of Mind, Medical NLP, Psychiatry

Recent Activity

new activity 1 day ago

danish-foundation-models/danish-dynaword:Add datasets quality metrics

reacted to davanstrien's post with 🤗 1 day ago

Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations! FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it? Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative. Today, I'm happy to share the first classifier trained on this data. 🔍 What we've built: - A lightweight classifier that efficiently removes low-quality content - 90%+ precision demonstrated on Danish & Swedish - Can process the 43M+ documents in Danish FineWeb2 with minimal compute 🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers. Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html

reacted to davanstrien's post with 🔥 1 day ago

Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations! FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it? Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative. Today, I'm happy to share the first classifier trained on this data. 🔍 What we've built: - A lightweight classifier that efficiently removes low-quality content - 90%+ precision demonstrated on Danish & Swedish - Can process the 43M+ documents in Danish FineWeb2 with minimal compute 🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers. Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html

View all activity

Organizations

Papers 7

arxiv:2406.13469

arxiv:2406.09556

arxiv:2406.02396

arxiv:2402.18209

models 12

KennethEnevoldsen/dfm-sentence-encoder-large

Feature Extraction • Updated Nov 27, 2024 • 71 • 1

KennethEnevoldsen/munin_mistral-7b

Text Generation • Updated Mar 18, 2024 • 17 • 1

KennethEnevoldsen/munin-e5

Feature Extraction • Updated Feb 29, 2024 • 2 • 1

KennethEnevoldsen/munin-7b-e5

Updated Feb 27, 2024 • 3

KennethEnevoldsen/munin-neuralbeagle-7b-e5

Updated Feb 27, 2024 • 2

KennethEnevoldsen/dacy-large-encoder

Feature Extraction • Updated Jan 31, 2024 • 4

KennethEnevoldsen/dfm-sentence-encoder-large-exp2-no-lang-align

Sentence Similarity • Updated Nov 15, 2023 • 5.7k • 1

KennethEnevoldsen/dfm-sentence-encoder-small

Sentence Similarity • Updated Nov 15, 2023 • 3

KennethEnevoldsen/dfm-sentence-encoder-medium-v1

Sentence Similarity • Updated Nov 15, 2023 • 11

KennethEnevoldsen/dfm-sentence-encoder-large-exp1

Sentence Similarity • Updated Nov 15, 2023 • 8

datasets 4

KennethEnevoldsen/danish-compounds

Viewer • Updated 1 day ago • 624 • 11 • 1

KennethEnevoldsen/spontanous-speech-qa

Viewer • Updated Oct 24, 2023 • 641 • 54

KennethEnevoldsen/dfm-paragraphs

Viewer • Updated Jul 13, 2023 • 7.8M • 39

KennethEnevoldsen/dane_plus

Viewer • Updated Jun 21, 2023 • 5.51k • 38 • 2