Daniel van Strien PRO
davanstrien
AI & ML interests
Machine Learning Librarian
Recent Activity
liked
a model
38 minutes ago
internlm/internlm3-8b-instruct
updated
a dataset
about 4 hours ago
data-is-better-together/fineweb-c-progress
updated
a dataset
about 8 hours ago
librarian-bots/dataset_cards_with_metadata
Articles
Organizations
davanstrien's activity
reacted to
AdinaY's
post with ๐ฅ
about 21 hours ago
Post
1195
MiniCPM-o2.6 ๐ฅ an end-side multimodal LLMs released by OpenBMB from the Chinese community
Model: openbmb/MiniCPM-o-2_6
โจ Real-time English/Chinese conversation, emotion control and ASR/STT
โจ Real-time video/audio understanding
โจ Processes up to 1.8M pixels, leads OCRBench & supports 30+ languages
Model: openbmb/MiniCPM-o-2_6
โจ Real-time English/Chinese conversation, emotion control and ASR/STT
โจ Real-time video/audio understanding
โจ Processes up to 1.8M pixels, leads OCRBench & supports 30+ languages
Post
2351
Introducing scandi-fine-web-cleaner
davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
๐ What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
๐ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
๐ What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
๐ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
replied to
their
post
1 day ago
Model wouldn't be possible without @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod for Swedish and @rasgaard @JakobBlaa @saattrupdan @FrLars21 @markhougaard @KennethEnevoldsen @Apasalic @tqvist @cnila @Soeren-B @KristianL @mathiasn1 @ITK-dev @jannikskytt @AndreasLH @perlausten @sorenmulli @organicoder for Danish!
posted
an
update
1 day ago
Post
2351
Introducing scandi-fine-web-cleaner
davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
๐ What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
๐ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
๐ What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
๐ Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
posted
an
update
5 days ago
Post
2007
The
data-is-better-together/fineweb-c dataset is growing!
This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.
Why should you care?
The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).
Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.
Why not use an LLM?
LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.
The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:
- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.
This week the following languages where done:
Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod
Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate
Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap
Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community
Contribute yourself here: data-is-better-together/fineweb-c
This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.
Why should you care?
The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).
Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.
Why not use an LLM?
LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.
The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:
- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.
This week the following languages where done:
Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod
Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate
Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap
Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community
Contribute yourself here: data-is-better-together/fineweb-c
reacted to
albertvillanova's
post with ๐
8 days ago
Post
1724
Discover all the improvements in the new version of Lighteval: https://huggingface.co/docs/lighteval/
replied to
their
post
19 days ago
There are some already in the Argilla instance!
You can also join the discussions here: https://huggingface.co/spaces/HuggingFaceFW/discussion :)
replied to
their
post
19 days ago
Thanks to the hard work of @ivykopal , the first 1,000 annotations for Slovak have been completed! Make sure to give Ivan a follow :)
reacted to
nicolay-r's
post with โค๏ธ
19 days ago
Post
2113
๐ข Deligted to share the most recent milestone on quick deployment of Named Entity Recognition (NER) in Gen-AI powered systems.
Releasing the bulk-ner 0.25.0 which represent a tiny framework that would save you time for deploing NER with any model.
๐ Why is this important? In the era of GenAI the handling out textual output might be challenging. Instead, recognizing named-entities via domain-oriented systems for your donwstream LLM would be preferable option.
๐ฆ: https://pypi.org/project/bulk-ner/0.25.0/
๐: https://github.com/nicolay-r/bulk-ner
I noticed that the direct adaptaion of the LM for NER would result in spending signifcant amount of time on formatting your texts according to the NER-model needs.
In particular:
1. Processing CONLL format with B-I-O tags from model outputs
2. Input trimming: long input content might not be completely fitted
To cope with these problems, in version 0.25.0 I made a huge steps forward by providing:
โ ๐ Python API support: see screenshot below for a quick deployment (see screenshot below ๐ธ)
โ ๐ชถ No-string: dependencies are now clear, so it is purely Python implementation for API calls.
โ ๐ Simplified output formatting: we use lists to represent texts with inner lists that refer to annotated objects (see screenshot below ๐ธ)
๐ We have a colab for a quick start here (or screenshot for bash / Python API ๐ธ)
https://colab.research.google.com/github/nicolay-r/ner-service/blob/main/NER_annotation_service.ipynb
๐ The code for pipeline deployment is taken from the AREkit project:
https://github.com/nicolay-r/AREkit
Releasing the bulk-ner 0.25.0 which represent a tiny framework that would save you time for deploing NER with any model.
๐ Why is this important? In the era of GenAI the handling out textual output might be challenging. Instead, recognizing named-entities via domain-oriented systems for your donwstream LLM would be preferable option.
๐ฆ: https://pypi.org/project/bulk-ner/0.25.0/
๐: https://github.com/nicolay-r/bulk-ner
I noticed that the direct adaptaion of the LM for NER would result in spending signifcant amount of time on formatting your texts according to the NER-model needs.
In particular:
1. Processing CONLL format with B-I-O tags from model outputs
2. Input trimming: long input content might not be completely fitted
To cope with these problems, in version 0.25.0 I made a huge steps forward by providing:
โ ๐ Python API support: see screenshot below for a quick deployment (see screenshot below ๐ธ)
โ ๐ชถ No-string: dependencies are now clear, so it is purely Python implementation for API calls.
โ ๐ Simplified output formatting: we use lists to represent texts with inner lists that refer to annotated objects (see screenshot below ๐ธ)
๐ We have a colab for a quick start here (or screenshot for bash / Python API ๐ธ)
https://colab.research.google.com/github/nicolay-r/ner-service/blob/main/NER_annotation_service.ipynb
๐ The code for pipeline deployment is taken from the AREkit project:
https://github.com/nicolay-r/AREkit
Post
3158
๐ธ๐ฐ Hovorte po slovensky? Help build better AI for Slovak!
We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!
Your contribution will help create better language models for 5+ million Slovak speakers.
Annotate here: data-is-better-together/fineweb-c.
Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community
We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!
Your contribution will help create better language models for 5+ million Slovak speakers.
Annotate here: data-is-better-together/fineweb-c.
Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community
posted
an
update
19 days ago
Post
3158
๐ธ๐ฐ Hovorte po slovensky? Help build better AI for Slovak!
We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!
Your contribution will help create better language models for 5+ million Slovak speakers.
Annotate here: data-is-better-together/fineweb-c.
Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community
We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!
Your contribution will help create better language models for 5+ million Slovak speakers.
Annotate here: data-is-better-together/fineweb-c.
Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community
posted
an
update
26 days ago
Post
1765
Introducing FineWeb-C ๐๐, a community-built dataset for improving language models in ALL languages.
Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.
318 annotators, 32K+ annotations, 12 languages - and growing! ๐
data-is-better-together/fineweb-c
Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.
318 annotators, 32K+ annotations, 12 languages - and growing! ๐
data-is-better-together/fineweb-c
reacted to
anton-l's
post with ๐ฅ
27 days ago
Post
2208
Introducing ๐๐
๐ข๐ง๐๐๐๐ญ๐ก: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by:
๐ ๏ธ carefully extracting math data from Common Crawl;
๐ iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! ๐
Weโre also releasing all the ablation models as well as the evaluation code.
HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by:
๐ ๏ธ carefully extracting math data from Common Crawl;
๐ iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! ๐
Weโre also releasing all the ablation models as well as the evaluation code.
HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
reacted to
stefan-it's
post with โค๏ธ
about 1 month ago
Post
1259
My latest project is the outcome of the last 2+ years working with TPUs from the amazing TPU Research Cloud (TRC) program and training Encoder-only LMs with the TensorFlow Model Garden library.
๐ Link: https://github.com/stefan-it/model-garden-lms
An overview of some features:
- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models
- Supported architectures include BERT, BERT with Token Dropping and TEAMS
I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!
๐ Model Hub Link: https://huggingface.co/model-garden-lms
If you find these resources useful, please give them a like!
Made from Bavarian Oberland with โค๏ธ and ๐ฅจ.
๐ Link: https://github.com/stefan-it/model-garden-lms
An overview of some features:
- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models
- Supported architectures include BERT, BERT with Token Dropping and TEAMS
I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!
๐ Model Hub Link: https://huggingface.co/model-garden-lms
If you find these resources useful, please give them a like!
Made from Bavarian Oberland with โค๏ธ and ๐ฅจ.
reacted to
davidberenstein1957's
post with ๐ฅ
about 1 month ago
Post
2075
Open Preference Dataset for Text-to-Image Generation by the ๐ค Community
Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.
https://huggingface.co/blog/image-preferences
Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.
https://huggingface.co/blog/image-preferences
reacted to
thomwolf's
post with ๐
about 1 month ago
Post
4793
We are proud to announce
HuggingFaceFW/fineweb-2: A sparkling update to
HuggingFaceFW/fineweb with 1000s of ๐ฃ๏ธlanguages.
We applied the same data-driven approach that led to SOTA English performance in๐ท FineWeb to thousands of languages.
๐ฅ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive ๐ ODC-By 1.0 license, and the ๐ป code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a ๐ blogpost walking you through the entire dataset creation process. Stay tuned!
In the mean time come ask us question on our chat place: HuggingFaceFW/discussion
H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
We applied the same data-driven approach that led to SOTA English performance in๐ท FineWeb to thousands of languages.
๐ฅ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive ๐ ODC-By 1.0 license, and the ๐ป code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a ๐ blogpost walking you through the entire dataset creation process. Stay tuned!
In the mean time come ask us question on our chat place: HuggingFaceFW/discussion
H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
posted
an
update
about 2 months ago
Post
512
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.
However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).
Humans will still have an important role in the loop to help improve models for all languages (and domains).
However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).
Humans will still have an important role in the loop to help improve models for all languages (and domains).
reacted to
andito's
post with ๐ฅ
about 2 months ago
Post
1837
SmolVLM speeding locally on a laptop thanks to mlx-vlm and
@Gradio ! Try it with two lines:
pip install git+https://github.com/andimarafioti/mlx-vlm.git@stream-generate-fix
python -m mlx_vlm.chat_ui --model mlx-community/SmolVLM-Instruct-8bit
Gotta love the MLX community! Big thanks to @pcuenq and @prince_canuma !
@Gradio ! Try it with two lines:
pip install git+https://github.com/andimarafioti/mlx-vlm.git@stream-generate-fix
python -m mlx_vlm.chat_ui --model mlx-community/SmolVLM-Instruct-8bit
Gotta love the MLX community! Big thanks to @pcuenq and @prince_canuma !
reacted to
MohamedRashad's
post with ๐
about 2 months ago
Post
1647
A while back i shared this model
MohamedRashad/arabic-small-nougat that was a finetune from
facebook/nougat-small for the Arabic Language.
Today this humble project has been scaled with new models, new datasets, new space, and a new paper
Check everything throught this collection here:
MohamedRashad/arabic-nougat-673a3f540bd92904c9b92a8e
Today this humble project has been scaled with new models, new datasets, new space, and a new paper
Check everything throught this collection here:
MohamedRashad/arabic-nougat-673a3f540bd92904c9b92a8e
reacted to
AdinaY's
post with ๐ค
about 2 months ago
Post
1122
Zhipu AI, the Chinese generative AI startup behind CogVideo, just launched their first productized AI Agent - AutoGLM ๐ฅ
๐ https://agent.aminer.cn
With simple text or voice commands, it:
โจ Simulates phone operations effortlessly
โจ Autonomously handles 50+ step tasks
โจ Seamlessly operates across apps
Powered by Zhipu's "Decoupled Interface" and "Self-Evolving Learning Framework" to achieve major performance gains in Phone Use and Web Browser Use!
Meanwhile, GLM4-Edge is now on Hugging Face hub๐
๐ THUDM/glm-edge-6743283c5809de4a7b9e0b8b
Packed with advanced dialogue + multimodal models:
๐ฑ 1.5B / 2B models: Built for mobile & in-car systems
๐ป 4B / 5B models: Optimized for PCs
๐ https://agent.aminer.cn
With simple text or voice commands, it:
โจ Simulates phone operations effortlessly
โจ Autonomously handles 50+ step tasks
โจ Seamlessly operates across apps
Powered by Zhipu's "Decoupled Interface" and "Self-Evolving Learning Framework" to achieve major performance gains in Phone Use and Web Browser Use!
Meanwhile, GLM4-Edge is now on Hugging Face hub๐
๐ THUDM/glm-edge-6743283c5809de4a7b9e0b8b
Packed with advanced dialogue + multimodal models:
๐ฑ 1.5B / 2B models: Built for mobile & in-car systems
๐ป 4B / 5B models: Optimized for PCs