Papers
arxiv:2501.04519

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Published on Jan 8
Β· Submitted by lynazhang on Jan 9
#1 Paper of the day
Authors:
,
,
,

Abstract

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

Community

Paper author Paper submitter

We present rStar-Math to demonstrate that small language models (SLMs, 1.5B-7B) can rival or even surpass the math reasoning capability of OpenAI o1

Β·

Dear Lyna, I've sent you an email on Friday ([email protected]), could you please check it and let me know what you think? Many thanks! Ksenia

holy... shit?

github link isn't working.

Β·
Paper author

As we are still undergoing the internal review process for open-source release, the repository remains private for now. Please stay tuned!

very impressive, I love the simplicity of using Q values as annotations! you mention 64 trajectories as some sort of saturation bound, is that right or have you just not tried scaling this approach even more?

Β·
Paper author

Thank you! On challenging math benchmarks such as AIME, performance nearly saturates with 64 trajectories. For college math, performance continues to improve steadily; however, we did not scale beyond 64 due to the increased search cost. We believe AIME performance can be further improved by synthesizing additional Olympiad-level math problems to improve both the policy model and the process reward model. We leave this as our future work.

Thank you for sharing this work. I appreciate the blend of Monte Carlo Tree Search with smaller models to address step-by-step math reasoning. The idea of generating self-verified solutions rather than relying on a larger teacher model is promising, and it is good to see how you handle the complexity of code-based rollouts. I am curious how this approach might adapt to tasks that involve geometric proofs or more symbolic reasoning. It would also be interesting to learn about the practical limits when problems become highly intricate. Overall, this is a thoughtful piece of research, and I look forward to any future expansions into broader math domains.

Β·
Paper author

Thank you for your comments! We currently have limited experience with tasks involving more symbolic reasoning. However, based on our understanding, the MCTS-based approach can adapt well to such tasks. You might find AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/) and DeepSeek-Prover1.5 (https://arxiv.org/abs/2408.08152) to be valuable references for exploring this direction further.

This is an incredibly impressive paper, and I’m very much looking forward to seeing the open-source code and the detailed development process.

We created a deep-dive video for this paper: https://www.youtube.com/watch?v=cHgHS6Y3QP0
Love to hear your feedback!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Very interesting work! I was curious if there is any section addressing data decontamination. From what I understand, Numina Math may include a notable portion of problems from OlympiadBench and Omni-Math.

Β·
Paper author

Thank you for your question! Decontamination is indeed critical for ensuring unbiased model performance evaluation. We tried our best to address this, including problem matching to identify and remove contaminated training samples from the dataset. For most of our evaluation benchmarks, such as GSM8K, AIME, AMC, CollegeMath and Gaokao, we did not find significant contamination. For MATH, OlympiadBench and Omni-Math, we identified a few hundred potentially contaminated examples and removed them from the training set to maintain the integrity of our evaluations.

This is like self-play to learn Go. It should be able to dramatically improve coding skills too.

very impressive paper, congrats !

This is a very nice work. Is it possible to measure original Qwen models with your PPM?
Could you clarify how trajectory counting works? For instance, I start with 64 trajectories and all of solutions have 10 steps. After each step, I retain only the 32 best paths. Then, I split each of these 32 paths into two, resulting in 64 trajectories again. I repeat this process 10 times (since the solution has 10 steps). In this case, how many trajectories do I have in total? Is it just 64, or is it 64+32Γ—10?

Β·
Paper author

Thank you for your questions! We tested Qwen2.5-72B-Instruct with our PPM. It’s not the math-specialized model as it struggles to effectively follow instructions to generate our code-augmented CoT steps. When combined with our 7B PPM, the 72B general model achieves math performance comparable to Qwen2.5-Math-72B-Instruct + Qwen2.5-RM-72B (MATH500: 85.8 vs. 85.0 (ours), AIME: 36.7 vs. 36.7, Olympiad Bench: 54.5 vs. 55.9). Note that math-specialized models typically outperform general-purpose models.

Very nice work! I have a question regarding the rollouts of the first round with the bootstrap DeepSeek model. You mentioned that you generate do 5 candidate notes and lets assume all produce running python code. Then you do the rollout and backprop of the first node(as all canidates have 0 visits, UCT is infinity) an and get a 1 or -1 for the q. Then you do the same for the other 4 candidate nodes as they have all 0 visits and therefore the highest UCT value.

This means just with the candidates of the very first step, you do 5 of the 8 rollouts and therefore you are not going very deep? Or am I missing something here?

Β·
Paper author

Good question! It will go very deep (currently we set a maximum depth=16). Rollout and backpropagation occur only at the terminal node. For each rollout, we start at the root node (the question) and generate a step-by-step trace until reaching the terminal node (the answer step). At the terminal node, we evaluate whether the answer is correct and backpropagate the reward score to update the Q value of every step node in the trajectory.

The Q value : instead of using rollout-traceback values as state value, you trained a PPM by pair-wise contrastive loss.

The training pair (to train PRM) is obtained by using PRM itself, as well as the final answers correctness to further filter out. Is this understanding correct ?
If correct, then how often do you update the PRM ?
Thanks

Β·
Paper author

Thank you for the question! As illustrated in Fig. 1(b), the training pairs are filtered from the MCTS tree (which is constructed using both the policy model and the PPM). The PRM (i.e., the PPM) is updated only once at the end of each self-evolution round, just like the policy model.

Checkout detailed paper explanation : https://gyanendradas.substack.com/p/rstar-paper-explained

Sharing a video & written summary - https://aipapersacademy.com/rstar-math/

Is there an ETA on the github going public?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.04519 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.04519 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.04519 in a Space README.md to link it from this page.

Collections including this paper 50