Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 β’ 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 β’ 16
Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? May 7, 2024 β’ 7
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 β’ 72
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 28
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper β’ 2501.07171 β’ Published 2 days ago β’ 36
HistBERTurk-Models Collection Fine-tuned BERTurk models for historical Turkish. β’ 3 items β’ Updated 10 days ago β’ 2
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models Paper β’ 2501.04828 β’ Published 7 days ago β’ 6
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution Paper β’ 2501.05040 β’ Published 6 days ago β’ 11
BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations Paper β’ 2501.03403 β’ Published 9 days ago β’ 4
view article Article Synthetic Data Generation with FastData and Hugging Face By asoria β’ 8 days ago β’ 13
view article Article Crowd-sourced Open Preference Dataset for Text-to-Image Generation By RapidataAI β’ 8 days ago β’ 17
CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions Paper β’ 2501.00097 β’ Published 16 days ago β’ 1
view article Article FineWeb2-C: Help Build Better Language Models in Your Language By davanstrien β’ 23 days ago β’ 12
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper β’ 2412.13663 β’ Published 28 days ago β’ 123
Granite 3.1 Language Models Collection A series of language models with 128K context length trained by IBM licensed under Apache 2.0 license. β’ 8 items β’ Updated 28 days ago β’ 48
ModernBERT Collection Bringing BERT into modernity via both architecture changes and scaling β’ 3 items β’ Updated 27 days ago β’ 123
Hf-native ColVision Models Collection Models that can be used with the native transformers π€ implementation instead of colpali-engine. β’ 2 items β’ Updated Dec 8, 2024 β’ 2
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages Paper β’ 2412.09587 β’ Published Dec 12, 2024 β’ 3
PaliGemma 2 Release Collection Vision-Language Models available in multiple 3B, 10B and 28B variants. β’ 23 items β’ Updated Dec 13, 2024 β’ 126