Knut Jägersberg's picture

Knut Jägersberg

KnutJaegersberg

AI & ML interests

NLP, opinion mining, narrative intelligence

Recent Activity

liked a model about 8 hours ago
internlm/internlm3-8b-instruct
liked a model about 14 hours ago
lightonai/modernbert-embed-large
liked a model 1 day ago
openbmb/MiniCPM-o-2_6
View all activity

Articles

Organizations

LLMs's profile picture Blog-explorers's profile picture Qwen's profile picture Social Post Explorers's profile picture M4-ai's profile picture Chinese LLMs on Hugging Face's profile picture Smol Community's profile picture

KnutJaegersberg's activity

reacted to prithivMLmods's post with 🔥 2 days ago
view post
Post
5697
Reasoning SmolLM2 🚀

🎯Fine-tuning SmolLM2 on a lightweight synthetic reasoning dataset for reasoning-specific tasks. Future updates will focus on lightweight, blazing-fast reasoning models. Until then, check out the blog for fine-tuning details.

🔥Blog : https://huggingface.co/blog/prithivMLmods/smollm2-ft

🔼 Models :
+ SmolLM2-CoT-360M : prithivMLmods/SmolLM2-CoT-360M
+ Reasoning-SmolLM2-135M : prithivMLmods/Reasoning-SmolLM2-135M
+ SmolLM2-CoT-360M-GGUF : prithivMLmods/SmolLM2-CoT-360M-GGUF

🤠 Other Details :
+ Demo : prithivMLmods/SmolLM2-CoT-360M
+ Fine-tune nB : prithivMLmods/SmolLM2-CoT-360M




reacted to davanstrien's post with 🔥 2 days ago
view post
Post
2653
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

🔍 What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
·
reacted to merve's post with ❤️ 2 days ago
view post
Post
3469
there's a new multimodal retrieval model in town 🤠
LlamaIndex released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM 💨
> shrinkable with matryoshka 🪆
> can do cross-lingual retrieval!
Collection: llamaindex/visual-document-retrieval-678151d19d2758f78ce910e1 (with models and datasets)
Demo: llamaindex/multimodal_vdr_demo
Learn more from their blog post here https://huggingface.co/blog/vdr-2b-multilingual 📖
posted an update 2 days ago