Agustín Piqueres Lajarín

plaguss

AI & ML interests

None yet

Recent Activity

updated a dataset about 3 hours ago
plaguss/docvqa-test
published a dataset about 3 hours ago
plaguss/docvqa-test
updated a dataset about 6 hours ago
plaguss/Llama-3.2-1B-Instruct-dvts-prm-completions
View all activity

Articles

Organizations

Hugging Face's profile picture SomosNLP's profile picture Hugging Face H4's profile picture Argilla's profile picture Blog-explorers's profile picture Hugging Face TB Research's profile picture Argilla Explorers's profile picture distilabel-internal-testing's profile picture Data Is Better Together's profile picture LLHF's profile picture SLLHF's profile picture Hugging Quants's profile picture argilla-internal-testing's profile picture Argilla Warehouse's profile picture Hugging Face FineVideo's profile picture smol-explorers's profile picture Hugging Face Science's profile picture Data Is Better Together Contributor's profile picture

plaguss's activity

upvoted an article 1 day ago
upvoted an article 10 days ago
view article
Article

Process Reinforcement through Implicit Rewards

By ganqu
16
reacted to lewtun's post with 🔥 16 days ago
view post
Post
2095
This paper ( HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:

* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.

Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
  • 1 reply
·
liked a Space 19 days ago