226 116 2635

Knut Jägersberg

KnutJaegersberg

jagersbergknut

AI & ML interests

NLP, opinion mining, narrative intelligence

Recent Activity

liked a model about 8 hours ago

internlm/internlm3-8b-instruct

liked a model about 14 hours ago

lightonai/modernbert-embed-large

liked a model 1 day ago

openbmb/MiniCPM-o-2_6

View all activity

Articles

Organizations

KnutJaegersberg's activity

liked a model about 8 hours ago

internlm/internlm3-8b-instruct

Updated about 5 hours ago • 127 • 76

liked a model about 14 hours ago

lightonai/modernbert-embed-large

liked 3 models 1 day ago

openbmb/MiniCPM-o-2_6

Any-to-Any • Updated 16 minutes ago • 1.46k • 311

jxm/cde-small-v2

Feature Extraction • Updated about 14 hours ago • 490 • 36

VITA-MLLM/Long-VITA-16K

Updated 23 days ago • 5

reacted to prithivMLmods's post with 🔥 2 days ago

Post

5697

Reasoning SmolLM2 🚀

🎯Fine-tuning SmolLM2 on a lightweight synthetic reasoning dataset for reasoning-specific tasks. Future updates will focus on lightweight, blazing-fast reasoning models. Until then, check out the blog for fine-tuning details.

🔥Blog : https://huggingface.co/blog/prithivMLmods/smollm2-ft

🔼 Models :
+ SmolLM2-CoT-360M : prithivMLmods/SmolLM2-CoT-360M
+ Reasoning-SmolLM2-135M : prithivMLmods/Reasoning-SmolLM2-135M
+ SmolLM2-CoT-360M-GGUF : prithivMLmods/SmolLM2-CoT-360M-GGUF

🤠 Other Details :
+ Demo : prithivMLmods/SmolLM2-CoT-360M
+ Fine-tune nB : prithivMLmods/SmolLM2-CoT-360M

reacted to davanstrien's post with 🔥 2 days ago

Post

2653

Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

🔍 What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html

1 reply

reacted to merve's post with ❤️ 2 days ago

Post

3469

there's a new multimodal retrieval model in town 🤠
LlamaIndex released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM 💨
> shrinkable with matryoshka 🪆
> can do cross-lingual retrieval!
Collection: llamaindex/visual-document-retrieval-678151d19d2758f78ce910e1 (with models and datasets)
Demo: llamaindex/multimodal_vdr_demo
Learn more from their blog post here https://huggingface.co/blog/vdr-2b-multilingual 📖