🤗👤 💻 Speaking of AI agents ... ...Is easier with the right words ;)
My colleagues @meg@evijit@sasha and @giadap just published a wonderful blog post outlining some of the main relevant notions with their signature blend of value-informed and risk-benefits contrasting approach. Go have a read!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
🔍 What we've built:
- A lightweight classifier that efficiently removes low-quality content - 90%+ precision demonstrated on Danish & Swedish - Can process the 43M+ documents in Danish FineWeb2 with minimal compute
🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.