Joseph G Flowers
Josephgflowers
AI & ML interests
None yet
Recent Activity
upvoted
a
paper
about 10 hours ago
Transformer^2: Self-adaptive LLMs
reacted
to
davanstrien's
post
with š„
1 day ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
š What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
š Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
liked
a dataset
1 day ago
mlabonne/smoltalk-semhashed
Organizations
Collections
1
spaces
1
models
52
Josephgflowers/TinyLlama-Cinder-Agent-v1
Text Generation
ā¢
Updated
ā¢
245
ā¢
1
Josephgflowers/TinyLlama-3T-Cinder-v1.2
Text Generation
ā¢
Updated
ā¢
964
ā¢
3
Josephgflowers/Phinance-Phi-3.5-mini-instruct-finance-v0.2
Text Generation
ā¢
Updated
ā¢
25
ā¢
1
Josephgflowers/Tinyllama-STEM-Cinder-Agent-v1
Updated
ā¢
4
Josephgflowers/Differential-Attention-Liquid-Metal-Tinyllama-Cinder
Updated
ā¢
1
Josephgflowers/Differential-Attention-Liquid-Metal-Tinyllama
Updated
ā¢
5
ā¢
1
Josephgflowers/Liquid-Metal-Tinyllama-Test-1
Updated
ā¢
5
Josephgflowers/Address-Parser-Tinyllama-v1
Updated
ā¢
64
Josephgflowers/TinyLlama-v1.1-Cinders-World
Updated
ā¢
4
Josephgflowers/140M-TinyLLama-Mini-Cinder-With-GGUF
Text Generation
ā¢
Updated
ā¢
76
ā¢
1
datasets
24
Josephgflowers/Finance-Instruct-500k
Viewer
ā¢
Updated
ā¢
518k
ā¢
37
ā¢
1
Josephgflowers/Par-Four-Fineweb-Edu-Fortified
Viewer
ā¢
Updated
ā¢
6.05M
ā¢
160
ā¢
6
Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Finance
Viewer
ā¢
Updated
ā¢
178k
ā¢
38
Josephgflowers/Phinance
Viewer
ā¢
Updated
ā¢
166k
ā¢
28
ā¢
1
Josephgflowers/Cinder_Char_Phi
Viewer
ā¢
Updated
ā¢
42.4k
ā¢
44
Josephgflowers/Cinder_Phi
Viewer
ā¢
Updated
ā¢
1.63M
ā¢
40
Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Chemistry-Physics-Astronomy-Math-Reason
Viewer
ā¢
Updated
ā¢
988k
ā¢
73
ā¢
1
Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Math
Viewer
ā¢
Updated
ā¢
182k
ā¢
37
Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Logic
Viewer
ā¢
Updated
ā¢
24.3k
ā¢
37
Josephgflowers/Synthia-v1.3-Implicit-Reasoning
Viewer
ā¢
Updated
ā¢
26.3k
ā¢
30