Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Recent Activity

liked a model about 1 hour ago

internlm/internlm3-8b-instruct

updated a dataset about 4 hours ago

data-is-better-together/fineweb-c-progress

updated a dataset about 9 hours ago

librarian-bots/dataset_cards_with_metadata

View all activity

Articles

FineWeb2-C: Help Build Better Language Models in Your Language

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Let’s make a generation of amazing image generation models

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

upvoted a paper 1 day ago

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Paper • 2501.07171 • Published 2 days ago • 36

upvoted a collection 5 days ago

HistBERTurk-Models

Fine-tuned BERTurk models for historical Turkish. • 3 items • Updated 10 days ago • 2

upvoted 2 papers 5 days ago

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

Paper • 2501.04828 • Published 7 days ago • 6

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

Paper • 2501.05040 • Published 6 days ago • 11

upvoted a paper 6 days ago

BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

Paper • 2501.03403 • Published 9 days ago • 4

upvoted 2 articles 8 days ago

Article

Synthetic Data Generation with FastData and Hugging Face

By

•

8 days ago

• 13

Article

Crowd-sourced Open Preference Dataset for Text-to-Image Generation

By

•

8 days ago

• 17

upvoted a collection 9 days ago

METAGENE-1

METAGENE-1 Models • 5 items • Updated 7 days ago • 5

upvoted a paper 9 days ago

CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions

Paper • 2501.00097 • Published 16 days ago • 1

upvoted 2 collections 19 days ago

🥂 FineWeb2

3 items • Updated Dec 8, 2024 • 11

QVQ

QVQ: Qwen models for visual reasoning • 7 items • Updated 14 days ago • 40

upvoted an article 23 days ago

Article

FineWeb2-C: Help Build Better Language Models in Your Language

By

•

23 days ago

• 12

upvoted a paper 26 days ago

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 28 days ago • 123

upvoted 2 collections 27 days ago

Granite 3.1 Language Models

A series of language models with 128K context length trained by IBM licensed under Apache 2.0 license. • 8 items • Updated 28 days ago • 48

ModernBERT

Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated 27 days ago • 123

upvoted a collection 28 days ago

Hf-native ColVision Models

Models that can be used with the native transformers 🤗 implementation instead of colpali-engine. • 2 items • Updated Dec 8, 2024 • 2

upvoted a paper 29 days ago

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

Paper • 2412.09587 • Published Dec 12, 2024 • 3

upvoted 2 collections 29 days ago

Sailor2 Post-training Datasets

3 items • Updated Dec 3, 2024 • 5

PaliGemma 2 Release

Vision-Language Models available in multiple 3B, 10B and 28B variants. • 23 items • Updated Dec 13, 2024 • 126

upvoted a collection about 1 month ago

FineWeb2 Collaborative Annotation Sprint

5 items • Updated 22 days ago • 6