David Berenstein's picture

David Berenstein

davidberenstein1957

AI & ML interests

Everything data

Recent Activity

updated a dataset 12 minutes ago
smol-blueprint/fineweb-bbc-news-text-embeddings
liked a model 25 minutes ago
minishlab/potion-base-8M
liked a model 31 minutes ago
ibm-granite/granite-embedding-30m-english
View all activity

Articles

Organizations

Hugging Face's profile picture SomosNLP's profile picture Tools's profile picture Webhooks Explorers (BETA)'s profile picture Argilla's profile picture Blog-explorers's profile picture Argilla Explorers's profile picture distilabel-internal-testing's profile picture Data Is Better Together's profile picture Social Post Explorers's profile picture argilla-internal-testing's profile picture Dataset Viber's profile picture Argilla Warehouse's profile picture Dataset Tools's profile picture Uplimit's profile picture Data Is Better Together Contributor's profile picture FeeL (Feedback Loop)'s profile picture Smol Blueprint's profile picture

davidberenstein1957's activity

posted an update about 20 hours ago
replied to davanstrien's post 5 days ago
view reply

Open collaboration is key for democratising AI.

reacted to davanstrien's post with 🤝❤️🚀 5 days ago
view post
Post
2007
The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c
  • 1 reply
·
posted an update 11 days ago
posted an update 16 days ago
posted an update 27 days ago
reacted to their post with 🔥 28 days ago
view post
Post
4193
Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator
  • 4 replies
·
replied to their post 28 days ago
replied to their post 28 days ago
view reply

thanks! Hope you can create some cool and useful datasets with it!

reacted to jwlben11's post with 🤗 29 days ago
view post
Post
2144
What is the use of hugginface? How can I get up to speed on ML and AI and how to use this platform? Would be nice if there was a get started here section.
  • 1 reply
·
reacted to their post with 🤯🧠❤️👀 30 days ago
view post
Post
4193
Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator
  • 4 replies
·
posted an update 30 days ago
view post
Post
4193
Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator
  • 4 replies
·
reacted to julien-c's post with 👀🚀😎 about 1 month ago
view post
Post
8320
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·