61 125 432

Yacine Jernite

yjernite

https://yjernite.github.io/

AI & ML interests

Technical, community, and regulatory tools of AI governance @HuggingFace

Recent Activity

upvoted a collection 1 day ago

ZeroGPU Spaces

reacted to meg's post with 🔥 1 day ago

💫...And we're live!💫 Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents" https://huggingface.co/blog/ethics-soc-7 Our analyses found: - There's a spectrum of "agent"-ness - *Safety* is a key issue, leading to many other value-based concerns Read for details & what to do next! With @evijit , @giadap , and @sasha

posted an update 1 day ago

🤗👤 💻 Speaking of AI agents ... ...Is easier with the right words ;) My colleagues @meg @evijit @sasha and @giadap just published a wonderful blog post outlining some of the main relevant notions with their signature blend of value-informed and risk-benefits contrasting approach. Go have a read! https://huggingface.co/blog/ethics-soc-7

View all activity

Articles

🇪🇺✍️ EU AI Act: Systemic Risks in the First CoP Draft Comments ✍️🇪🇺

Dec 12, 2024

• 12

Open Source Developers Guide to the EU AI Act

Dec 2, 2024

• 37

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

Jul 3, 2024

• 8

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Jun 24, 2024

• 33

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Dec 5, 2023

• 1

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 28

AI Policy @🤗: Open ML Considerations in the EU AI Act

Jul 24, 2023

• 2

AI Policy @🤗: Response to the U.S. NTIA's Request for Comment on AI Accountability

Jun 20, 2023

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

May 15, 2023

Ethics and Society Newsletter #3: Ethical Openness at Hugging Face

Mar 30, 2023

Ethics and Society Newsletter #2: Let's talk about bias!

Dec 15, 2022

Putting ethical principles at the core of research lifecycle

May 19, 2022

Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets

Nov 29, 2021

Organizations

yjernite's activity

reacted to meg's post with 🔥 1 day ago

Post

2053

💫...And we're live!💫 Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents"
https://huggingface.co/blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha

posted an update 1 day ago

Post

1660

🤗👤 💻 Speaking of AI agents ...
...Is easier with the right words ;)

My colleagues @meg @evijit @sasha and @giadap just published a wonderful blog post outlining some of the main relevant notions with their signature blend of value-informed and risk-benefits contrasting approach. Go have a read!

https://huggingface.co/blog/ethics-soc-7

reacted to merve's post with 👀 29 days ago

Post

3326

Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models 🧶

✨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
✨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled 📈 scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful 🔥
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models 🔥

6 replies

reacted to fdaudens's post with 👀 29 days ago

Post

1329

Did a fun experiment: What are the main themes emerging from the 100+ Nieman Journalism Lab predictions for 2025?

I used natural language processing to cluster and map them — really helps spot patterns that weren't obvious when reading predictions one by one. So what will shape journalism next year? A lot of AI and US politics (surprise!), but there's also this horizontal axis that spans from industry strategies to deep reflections on how to talk to the public.

Click any dot to explore the original prediction. What themes surprise/interest you the most?

👉 fdaudens/nieman_lab_2025_predictions_visualization

P.s.: I discovered that Nieman Lab's content is under Creative Commons license!

posted an update about 1 month ago

Post

2117

🇪🇺 Policy Thoughts in the EU AI Act Implementation 🇪🇺

There is a lot to like in the first draft of the EU GPAI Code of Practice, especially as regards transparency requirements. The Systemic Risks part, on the other hand, is concerning for both smaller developers and for external stakeholders.

I wrote more on this topic ahead of the next draft. TLDR: more attention to immediate large-scale risks and to collaborative solutions supported by evidence can help everyone - as long as developers disclose sufficient information about their design choices and deployment contexts.

Full blog here, based on our submitted response with @frimelle and @brunatrevelin :

https://huggingface.co/blog/yjernite/eu-draft-cop-risks#on-the-proposed-taxonomy-of-systemic-risks

2 replies

reacted to dvilasuero's post with ❤️🔥 about 1 month ago

Post

2309

🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU

reacted to fdaudens's post with ❤️ about 1 month ago

Post

1073

📈👀 Just dropped: visualization mapping Hugging Face's most liked & downloaded models from 2022 to now. Small models are clearly on the rise - fascinating shift in both likes and download patterns.

Check it out: huggingface/open-source-ai-year-in-review-2024

reacted to AdinaY's post with ❤️ about 1 month ago

Post

1485

2023 & 2024 Top Downloaded (all time) Open Models on the hub are both from the Chinese community 👀

2023 👉 Bge base by BAAI
BAAI/bge-base-en-v1.5
2024 👉 Qwen 2.5 by Alibaba Qwen
Qwen/Qwen2.5-1.5B-Instruct

Can’t wait to see what incredible models the Chinese community will bring in 2025🚀

✨ Follow https://huggingface.co/zh-ai-community to get the latest updates from the Chinese community
✨ Explore the 2024 Year in Review huggingface/open-source-ai-year-in-review-2024

reacted to cfahlgren1's post with ❤️ about 2 months ago

Post

3133

You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned

1 reply

reacted to fdaudens's post with 🔥 2 months ago

Post

1836

Fascinating point from @thomwolf at Web Summit: AI misuse (deepfakes, fake news) is actually easier to make with closed models, not with open-source ones.

This challenges the common narrative that open-source AI is inherently more dangerous. The reality is more nuanced - while we may think open source is technically easier to misuse, closed models' accessibility and product-focused design appear to be driving more actual harm.

Important context for current AI safety discussions and regulation debates.

Do you agree? 👇

1 reply

reacted to erinys's post with 🚀 3 months ago

Post

2156

🌍 Super cool visualization of global PUT requests to Hugging Face over 24 hours, coded by object size, thanks to @port8080 !

We're putting this analysis to work to help us architect a more geo-distributed system for the HF storage backend.

Originally shared on LinkedIn: https://www.linkedin.com/posts/ajitbanerjee_one-of-the-joys-of-working-on-the-xethub-activity-7252688424732614656-tFGD

reacted to fdaudens's post with ❤️👀 5 months ago

Post

1504

‘AI in the News’ of the day:

Anthropic publishes the ‘system prompts’ that make Claude tick
- "In its continued effort to paint itself as a more ethical, transparent AI vendor, Anthropic has published the system prompts for its latest models"
- They specify that “Claude cannot open URLs, links, or videos, perform facial recognition or identify or name any humans in photos.
- "Anthropic is exerting pressure on competitors to publish the same. We’ll have to see if the gambit works."
https://techcrunch.com/2024/08/26/anthropic-publishes-the-system-prompt-that-makes-claude-tick/

China’s tech giants splash out on AI despite US restrictions (paywall)
- "Alibaba, Tencent and Baidu had combined capital expenditure of Rmb50bn ($7bn) in the first half, compared with Rmb23bn a year earlier. TikTok parent ByteDance (which is private) has also increased AI-related spending"
- Nvidia's H100 and upcoming Blackwell series are under US restrictions, but China’s tech giants can buy H20
- Analysts expect Nvidia to ship more than 1mn of the processors to Chinese tech groups in the coming months.
https://www.ft.com/content/31bffc48-2ca7-472b-9d53-3deaad2d86ce

MZ "said it was improper for the Biden administration to have pressured Facebook to censor content in 2021 related to the coronavirus pandemic"
- "At the time, Facebook’s publicly stated goal was to push millions of people toward Covid-19 vaccines. In his letter, Zuckerberg didn’t indicate whether he had changed his mind about that goal"
https://www.wsj.com/tech/mark-zuckerberg-neutral-politics-letter-election-2024-02b86372

Food for thought:
- Why don’t women use artificial intelligence?
https://www.economist.com/finance-and-economics/2024/08/21/why-dont-women-use-artificial-intelligence
- Most AI avatars look female, young and attractive. Are they a passing trend or here to stay?
https://reutersinstitute.politics.ox.ac.uk/news/most-ai-avatars-look-female-young-and-attractive-are-they-passing-trend-or-here-stay

reacted to clem's post with 🔥 5 months ago

Post

4134

Just crossed 200,000 free public AI datasets shared by the community on Hugging Face! Text, image, video, audio, time-series & many more... Thanks everyone!

http://hf.co/datasets

reacted to lunarflu's post with 🔥 6 months ago

Post

1889

Cool things this week from @huggingface !

🌎AI math olympiad winner NuminaMath is here!
🤗Announcing New Hugging Face and Keras NLP integration
✨UI overhaul to HF tokens!
🧊 Embed our dataset viewer on any webpage!

https://huggingface.co/blog/winning-aimo-progress-prize
https://huggingface.co/blog/keras-nlp-integration
https://huggingface.co/settings/tokens
https://x.com/julien_c/status/1812099420726456457

Check out the full list on our discord! 👇
https://discord.com/invite/JfAtkvEtRb

reacted to fdaudens's post with ❤️🚀🤝🔥 6 months ago

Post

3306

Small models, BIG impact: SmolLM is here! 🚀🔬

We're launching a series of small but mighty language models:
🏎️ Super fast - runs on laptops, phones, you name it!
📏 3 sizes: 130M, 350M, and 1.5B parameters
🥇 Outperforms same size models from Meta, Microsoft, and Qwen
🔓 Fully open-source: datasets, training code, models

𝐊𝐞𝐲 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬
- Trained on FineWeb-Edu and Cosmopedia v2 (largest synthetic pre-training dataset)
- No cloud needed - run locally for privacy and energy efficiency
- Everything is public, from data curation to training steps

𝐏𝐨𝐭𝐞𝐧𝐭𝐢𝐚𝐥 𝐮𝐬𝐞 𝐜𝐚𝐬𝐞𝐬
- On-device autocomplete
- Local request parsing
- Custom fine-tuning for specific needs without the need for expensive GPUs

𝐆𝐨 𝐝𝐞𝐞𝐩𝐞𝐫
👉 Check it out: https://huggingface.co/collections/HuggingFaceTB/smollm-models-6695016cad7167254ce15966
👉 Run the 360M model in your browser, 100 % private: HuggingFaceTB/SmolLM-360M-Instruct-WebGPU
👉 Read the blog explaining everything in detail: huggingface.co/blog/smollm

Kudos to the stellar team who worked on this project: @loubnabnl @anton-l @eliebak @lvwerra

Yacine Jernite

AI & ML interests

Recent Activity

Articles

🇪🇺✍️ EU AI Act: Systemic Risks in the First CoP Draft Comments ✍️🇪🇺

Open Source Developers Guide to the EU AI Act

EU Training Data Transparency: A Proposal for a Sufficiently Detailed Summary 📑📚🖼️🇪🇺

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Public Policy at Hugging Face

Policy Questions Blog 1: AI Data Transparency Remarks for NAIAC Panel 📚🔍⚖️

AI Watermarking 101: Tools and Techniques

📚 Training Data Transparency in AI: Tools, Trends, and Policy Recommendations 🗳️

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

AI Policy @🤗: Open ML Considerations in the EU AI Act

AI Policy @🤗: Response to the U.S. NTIA's Request for Comment on AI Accountability

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

Ethics and Society Newsletter #3: Ethical Openness at Hugging Face

Ethics and Society Newsletter #2: Let's talk about bias!

Putting ethical principles at the core of research lifecycle

Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets

Organizations

yjernite's activity