How to teach Hugging Face?

non-profit

AI & ML interests

A Tour through the ๐Ÿค— Hub โ€ข Build and Host ML Demos with Gradio & ๐Ÿค— โ€ข Getting Started with Transformers

teach's activity

lewtunย 
posted an update 10 days ago
view post
Post
3222
I was initially pretty sceptical about Meta's Coconut paper [1] because the largest perf gains were reported on toy linguistic problems. However, these results on machine translation are pretty impressive!

https://x.com/casper_hansen_/status/1875872309996855343

Together with the recent PRIME method [2] for scaling RL, reasoning for open models is looking pretty exciting for 2025!

[1] Training Large Language Models to Reason in a Continuous Latent Space (2412.06769)
[2] https://huggingface.co/blog/ganqu/prime
lewtunย 
posted an update 16 days ago
view post
Post
2095
This paper ( HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:

* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.

Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
  • 1 reply
ยท
lewtunย 
posted an update 30 days ago
view post
Post
6725
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute ๐Ÿ”ฅ

How? By combining step-wise reward models with tree search algorithms :)

We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"

We're open sourcing the full recipe and sharing a detailed blog post.

In our blog post we cover:

๐Ÿ“ˆ Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.

๐ŸŽ„ Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.

๐Ÿงญ Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM

Here's the links:

- Blog post: HuggingFaceH4/blogpost-scaling-test-time-compute

- Code: https://github.com/huggingface/search-and-learn

Enjoy!
  • 2 replies
ยท
christopherย 
posted an update about 1 month ago
view post
Post
1609
The folks at Foursquare released a dataset of 104.5 million places of interest ( foursquare/fsq-os-places) and here's all of them on a plot
ยท
christopherย 
posted an update about 1 month ago
BramVanroyย 
posted an update about 1 month ago
view post
Post
522
In the spirit of "Better late than never", I've finally written a brief overview paper for GEITje 7B Ultra. Initially released 10 months ago (oops), but still reaching around 1300 monthly downloads across the HF ecosystem (not including ollama).

GEITje 7B Ultra: A Conversational Model for Dutch (2412.04092)

While the paper discusses the model a little bit, I especially wanted to write about the datasets, which to this day seem an important asset for Dutch LLM training (SFT and preference tuning). We have a long way to go for Dutch, but publishing transparent and reproducible artefacts seems an important step to me, alongside having open discussions about data, bias, architectures.

In that spirit, thanks are in order for the creation of GEITje 7B Ultra and all related datasets:

- Michiel Buisman and UWV for providing the means to create the datasets
- Flemish Supercomputer Center (VSC) for the compute
- The Hugging Face Fellows and rest of the team for their discussions and insights
- The Dutch NLP community, notably @Rijgersberg for building the base GEITje model and the fruitful discussions we've had

More to come, step by step!

BramVanroy/geitje-7b-ultra-65c1ee010ad80fd1f6a8f208
christopherย 
posted an update 4 months ago
view post
Post
1323
4 million chess puzzles
mrm8488ย 
posted an update 7 months ago
view post
Post
4885
๐ŸšจExciting news for the Multilingual Synthetic Data Community!๐Ÿšจ

Iโ€™ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Hereโ€™s whatโ€™s new!

๐Ÿ—ž The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

๐Ÿค” While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

๐ŸŽ‰ And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

๐Ÿ‘ฉโ€๐Ÿ’ป To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)


๐Ÿ” Explore the datasets ๐Ÿ“š generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)


Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
ยท
gsartiย 
posted an update 8 months ago
view post
Post
1693
@victor unprompted feature request: I'd love to have a toggle for a HF collection to control whether new items are added to the top or to the bottom. At the moment everything gets added at the bottom, but it would be great to have newer elements on top to make fresh content easily accessible without having to scroll all the way!
  • 3 replies
ยท
BramVanroyย 
posted an update 8 months ago
view post
Post
1762
The InstructGPT paper mentions that they insert 10% pretraining data during SFT, which they find improves the effect of PPO (IIUC). Has anyone else done later ablations on this? I've only seen the inverse suggested, mixing in SFT data during pretraining.
  • 2 replies
ยท
BramVanroyย 
posted an update 8 months ago
view post
Post
2299
All my models seem to be plagued by infinite lists. When you ask a question that requires it to write a list, it most often keeps adding bullet points or enumeration. I am wondering whether this is a result of using chatty GPT-4 as DPO preferences. Any thoughts?
  • 1 reply
ยท
mrm8488ย 
posted an update 8 months ago
view post
Post
5754
Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.
ยท
gsartiย 
posted an update 9 months ago
view post
Post
2963
๐Ÿ” Today's (self-serving) pick in Interpretability & Analysis of LMs:

A Primer on the Inner Workings of Transformer-based Language Models
by @javifer @gsarti @arianna-bis and M. R. Costa-jussร 
( @mt-upc , @GroNLP , @facebook )

This primer can serve as a comprehensive introduction to recent advances in interpretability for Transformer-based LMs for a technical audience, employing a unified notation to introduce network modules and present state-of-the-art interpretability methods.

Interpretability methods are presented with detailed formulations and categorized as either localizing the inputs or model components responsible for a particular prediction or decoding information stored in learned representations. Then, various insights on the role of specific model components are summarized alongside recent work using model internals to direct editing and mitigate hallucinations.

Finally, the paper provides a detailed picture of the open-source interpretability tools landscape, supporting the need for open-access models to advance interpretability research.

๐Ÿ“„ Paper: A Primer on the Inner Workings of Transformer-based Language Models (2405.00208)

๐Ÿ” All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
gsartiย 
posted an update 9 months ago
view post
Post
2452
๐Ÿ” Today's pick in Interpretability & Analysis of LMs: by @aadityasingh T. Moskovitz, F. Hill, S. C. Y. Chan, A. M. Saxe ( @gatsbyunit )

This work proposes a new methodology inspired by optogenetics (dubbed "clamping") to perform targeted ablations during training to estimate the causal effect of specific interventions on mechanism formation.

Authors use this approach to study the formation of induction heads training a 2L attention-only transformer to label examples via context information.

Notable findings:

- The effects of induction heads are additive and redundant, with weaker heads compensating well for the ablation of a strong induction head in case the latter is ablated.
- Competition between induction heads might emerge as a product of optimization pressure to converge faster, but it is not strictly necessary as all heads eventually learn to solve the task.
- Previous token heads (PTH) influence induction heads in a many-to-many fashion, with any PTH eliciting above-chance prediction from a subsequent induction head
- Three subcircuits for induction are identified, respectively mixing token-label information (1 + 2), matching the previous occurrence of the current class in the context (3qk + 4), and copying the label of the matched class (3v + 5).
- The formation of induction heads is slowed down by a larger number of classes & labels, with more classes and more labels slowing down the formation of the matching and copying mechanisms, respectively. This may have implications when selecting a vocabulary size for LLMs: larger vocabularies lead to an increased compression ratio and longer contexts, but they might make copying more challenging by delaying the formation of induction heads.

๐Ÿ’ป Code: https://github.com/aadityasingh/icl-dynamics

๐Ÿ“„ Paper: What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation (2404.07129)

๐Ÿ” All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
gsartiย 
posted an update 9 months ago
view post
Post
1773
I'm super happy to co-organize the (Mechanistic) Interpretability social at #ICLR2024 with @nikhil07prakash ! ๐Ÿ”

If you plan to attend, help us make this meetup awesome by filling the form below! ๐Ÿ˜„

๐Ÿ“… Wed, May 8, 12:45-2:15 PM
๐Ÿ”— RSVP & share your ideas here: https://forms.gle/FWap4KW2ikdntjfb8
ยท