23 46 62

Elie Bakouch

eliebak

AI & ML interests

Training LLM's @ 🤗

Recent Activity

liked a model about 11 hours ago

MiniMaxAI/MiniMax-Text-01

updated a model 1 day ago

kyutai/helium-1-preview-2b

new activity 1 day ago

kyutai/helium-1-preview-2b:fix title

View all activity

Articles

Organizations

eliebak's activity

liked a model about 11 hours ago

MiniMaxAI/MiniMax-Text-01

Text Generation • Updated 1 minute ago • 132 • 237

updated a model 1 day ago

kyutai/helium-1-preview-2b

Text Generation • Updated 1 day ago • 1.5k • 83

New activity in kyutai/helium-1-preview-2b 1 day ago

fix title

#2 opened 1 day ago by

eliebak

upvoted a collection 1 day ago

SmolLM2

Collection

State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 15 items • Updated 24 days ago • 199

liked a model 2 days ago

kyutai/helium-1-preview-2b

Text Generation • Updated 1 day ago • 1.5k • 83

upvoted a paper 6 days ago

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Paper • 2501.04519 • Published 7 days ago • 218

liked a Space 6 days ago

Running on Zero

141

👀

Gaze Demo

Gaze detection using Moondream

liked a model 6 days ago

vikhyatk/moondream2

Image-Text-to-Text • Updated 6 days ago • 126k • 947

liked a model 7 days ago

microsoft/phi-4

Text Generation • Updated 7 days ago • 72.3k • 1.3k

liked a dataset 8 days ago

cognitivecomputations/HuggingFaceTB_smoltalk-DolphinLabeled

Viewer • Updated 9 days ago • 1.03M • 74 • 7

upvoted a collection 8 days ago

DolphinLabeled Datasets

Collection

Eric Hartford has added labels to help you filter datasets, for your pleasure. • 5 items • Updated 9 days ago • 8

upvoted a paper 12 days ago

2 OLMo 2 Furious

Paper • 2501.00656 • Published 15 days ago • 15

upvoted a paper 13 days ago

YuLan-Mini: An Open Data-efficient Language Model

Paper • 2412.17743 • Published 23 days ago • 62

liked a model 14 days ago

PowerInfer/SmallThinker-3B-Preview

Text Generation • Updated 9 days ago • 40.7k • 346

liked a Space 15 days ago

Running

395

📈

2024 AI Timeline

New activity in reach-vb/2024-ai-timeline 15 days ago

Update index.html

#5 opened 15 days ago by

eliebak

liked a Space 16 days ago

Running

719

🦀

README

reacted to Kseniase's post with 🔥 17 days ago

Post

2496

10 Free Comprehensive Datasets for Supervised Fine-Tuning

High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.

So today, we invite you to explore top 10 free datasets on natural language processing and maths:

1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.

2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation.

3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.

4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.

5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.

6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.

7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Math datasets:

1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.

2. amphora/QwQ-LongCoT-130K for training O1-like LLMs.

3. openai/gsm8k for training multi-step reasoning.

Elie Bakouch

AI & ML interests

Recent Activity

Articles

Diving into MiniMax01 405B MoE

SmolVLM - small yet mighty Vision Language Model

SmolLM - blazingly fast and remarkably powerful

Organizations

eliebak's activity

fix title

Gaze Demo

2024 AI Timeline

2024 AI Timeline

Update index.html

Gemini Coder

README