Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
14
Magnus Enbom
menbom
Follow
davanstrien's profile picture
21world's profile picture
2 followers
ยท
9 following
menbom
menbom
AI & ML interests
None yet
Recent Activity
reacted
to
davanstrien
's
post
with ๐ฅ
1 day ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations! FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it? Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative. Today, I'm happy to share the first classifier trained on this data. ๐ What we've built: - A lightweight classifier that efficiently removes low-quality content - 90%+ precision demonstrated on Danish & Swedish - Can process the 43M+ documents in Danish FineWeb2 with minimal compute ๐ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers. Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
liked
a model
about 2 months ago
mistralai/Pixtral-Large-Instruct-2411
liked
a model
about 2 months ago
google/gemma-2-27b-it
View all activity
Organizations
models
3
Sort:ย Recently updated
menbom/test-setfit-model
Sentence Similarity
โข
Updated
Feb 1, 2023
โข
2
menbom/donut-base-sroie
Image-Text-to-Text
โข
Updated
Sep 19, 2022
โข
5
menbom/distilbert-base-uncased-finetuned-emotion
Text Classification
โข
Updated
Jun 3, 2022
โข
12
datasets
None public yet