Process Reinforcement through Implicit Rewards
Ganqu Cui , Lifan Yuan , Zefan Wang , Hanbin Wang , Wendi Li , Bingxiang He , Yuchen Fan , Tianyu Yu , Qixin Xu , Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
: Project lead
: Core contributors
GitHub: https://github.com/PRIME-RL/PRIME
In this blog post, we introduce PRIME (Process Reinforcement through IMplicit REwards), a scalable RL solution for advanced reasoning through implicit process rewards. Our main contributions:
- We present PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards, to advance reasoning abilities of language models beyond imitation or distillation.
- With PRIME, starting from Qwen2.5-Math-7B-Base, our trained model Eurus-2-7B-PRIME achieves 26.7% pass@1, surpassing GPT-4o and Qwen2.5-Math-7B-Instruct. We achieve this with only 1/10 data of Qwen Math (230K SFT + 150K RL).
- We also explore inference-time scaling and train EurusPRM, a SOTA-level math PRM that pushes the boundary even further.
- Work in Progress. All models and data released. Code coming soon!
Tell me and I forget, teach me and I remember, involve me and I learn.
— Benjamin Franklin
Contents
- Introduction
- Preparation & Imitation Warmup
- Process Reward Models
- Reinforcement Learning
- Experiments
- Inference Scaling with Implicit PRM
- Appendix
Introduction
Figure: Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.
While advanced reasoning of large language models (LLMs) is improvable through data-driven imitation, it creates fundamental scalability barriers - as better reasoning requires exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. We believe the key to overcoming such challenges lies in transforming data-driven approaches into exploration-based methods, as exemplified by reinforcement learning (RL). To this end, two critical challenges need to be addressed to bridge this transformation: (1) how to obtain precise reward signals efficiently and scalably, especially for dense ones? (2) how can we build effective RL algorithms to fully unleash the potential of these signals?
In this blog, we seek the scalable path towards advanced reasoning capabilities with efficient reward modeling and reinforcement learning.
Our recent study presented the implicit process reward modeling (PRM) objective. Without the need for any process label, implicit PRM is trained as an outcome reward model (ORM) and then used as a PRM. Inspired by this captivating property, we find that besides improving model performance through inference scaling, the true power of the implicit PRM is unveiled in online RL training. Specifically, it brings three benefits to RL:
- Dense Reward: Implicit PRM directly learns a Q-function that provides rewards for each token, which alleviates the reward sparsity issue without the need of an extra value model.
- Scalability: Implicit PRM can be online updated with only outcome label. Therefore, we can directly update the PRM with on-policy rollouts given outcome verifiers, which mitigates the distribution shift as well as scalability issues for PRMs.
- Simplicity: Implicit PRM is inherently a language model. In practice, we show that it is unnecessary to train a PRM beforehand, since the SFT model itself already serves as a strong starting point.
We then dive into RL to figure out its key algorithm designs and implementation techniques. To this end, we present Process Reinforcement through IMplicit rEwards, PRIME, which effectively incorporates and updates PRMs in RL.
As an intermediate result, through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to 16.7% improvement on average, and over 20% on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. We then train a PRM with the implicit PRM objective for inference-time scaling, which further boosts the models’s reasoning capability.
The evaluation results of the opening figure are detailed below:
Eurus-2-7B-PRIME | Eurus-2-7B-SFT | Qwen-2.5-Math-7B-Instruct | Llama-3.1-70B-Instruct | GPT-4o | |
---|---|---|---|---|---|
AIME 2024 | 26.7 (+23.3) | 3.3 | 13.3 | 16.7 | 9.3 |
MATH-500 | 79.2 (+14.1) | 65.1 | 79.8 | 64.6 | 76.4 |
AMC | 57.8 (+27.7) | 30.1 | 50.6 | 30.1 | 45.8 |
Minerva Math | 38.6 (+5.9) | 32.7 | 34.6 | 35.3 | 36.8 |
OlympiadBench | 42.1 (+12.3) | 29.8 | 40.7 | 31.9 | 43.3 |
Avg. | 48.9 (+ 16.7) | 32.2 | 43.8 | 35.7 | 43.3 |
We achieve this with only 1/10 data resources compared with Qwen-Math. The following is a comparison of resource requirements between Eurus-2-7B-PRIME and Qwen2.5-Math-7B-Instruct.
Eurus-2-7B-PRIME | Qwen2.5-Math-7B-Instruct | |
---|---|---|
Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
SFT Data | 230K (open-source) | 2.5M (open-source and in-house) |
RM Data | 0 | 618K (in-house) |
RM | Eurus-2-7B-SFT | Qwen2.5-Math-RM (72B) |
RL Data | 150K queries 4 samples | 66K queries 32 samples |
This blog will introduce:
- The implicit process reward modeling objective and why it’s advantageous for PRM&RL
- The PRIME algorithm which incorporates implicit process reward into online RL
- The full recipe to build a strong reasoning model Eurus-2-7B-PRIME
- How we further enhanced its performance by inference-time scaling with EurusPRM
We release all the models and data used in this research.
Preparation and Imitation Warmup
Models and Evaluation Datasets
We select Qwen2.5-Math-7B-Base as the starting point for its great mathematical capabilities.
For evaluation, we primarily adopt competition-level mathematics and programming benchmarks, as well as several commonly used datasets, including AIME 2024, AMC, MATH-500, Minerva Math, OlympiadBench, LeetCode and LiveCodeBench(v2).
Imitation Learning
We first performed supervised finetuning on the base model to get a starter model for RL.
Action-centric chain-of-thought reasoning
We applied imitation learning (supervised finetuning) as a warmup stage to teach models to learn certain reasoning patterns. To this end, we first designed an action-centric chain-of-thought reasoning framework, where the policy model chooses one of 7 actions at each step and stops after executing each action.
SFT dataset construction
To construct the SFT dataset, we collected reasoning instructions from several open-source datasets. It is noteworthy that we did not include many datasets with ground-truth answers in SFT even though they are of higher quality, but reserved them for the later RL training. The reason is that we aim to use different datasets for SFT and RL to diversify the exploration in RL, and we consider ground-truth more essential in RL than in SFT. For completion, we employ LLaMA-3.1-70B-Instruct to answer the instructions, with a system prompt requesting the model to perform action-centric chain-of-thought.
We finally obtained 230K SFT data, the detailed sources and statistics can be found in Appendix.
SFT results
After finetuning, the performance of our SFT model is reported in the starting figure.
Compared with Qwen2.5-Math-7B-Instruct, our SFT model lags behind it on all mathematics benchmarks.
Process Reward Models
Implicit PRM: Free Process Rewards without Process Labels
We adopt Implicit PRM, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.
The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
Proposition: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. . Define . is the exponential average of at step .
Hence, represents an exact expectation of outcome reward at step , i.e., the Q value.
The proposition indicates that when modeling to train an ORM with the standard pipeline, where is a hyperparameter, can implicitly learn a Q function. Hence, process reward can be obtained by:
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
The proposition is agnostic to specific choices of the training objective of ORMs. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the with . For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
Reinforcement Learning
Our goal is clear and focused: to extensively leverage reinforcement learning (RL) to enhance reasoning capabilities. Aiming at the best practices of such a paradigm with limited resources, our key insights can be summarized below:
- Start from high-quality data with ground truth verifiers: We did rigorous data collection and cleaning to obtain verifiable RL data, and found that using outcome verifier only are already strong baselines.
- Simple REINFORCE-like algorithms are surprisingly effective: We compared different RL algorithms and concluded that value model-free REINFORCE-like methods are powerful enough.
- Use “mid-difficulty” problems for stabilized training: We proposed a mechanism named online prompt filter, which largely stabilized RL training by filtering out over difficult and simple questions.
- Implicit process rewards push the boundary even further! We successfully integrated process rewards into online RL, and observed great training acceleration and performance improvement. The method is seamlessly accessible to everyone.
Pilot Study on Algorithms and Data
RL Data Collection & Preprocessing
We curated a high-quality RL training dataset of mathematics and coding problems with outcome verifiers (LaTeX answers for math and test cases for coding).
- For math, we sourced from NuminaMath-CoT, which contains about 860K math problems. The problems span from Chinese high school mathematics to International Mathematical Olympiad competition questions.
- For coding, we sourced from APPS, CodeContests, TACO, and Codeforces.
To further increase data quality, we conducted detailed cleaning and filtering. Detailed data preprocessing can be found in Appendix. Finally, we retain 457k math problems and 27k coding problems.
Online Prompt Filtering
During the rollout stage, we find that choosing appropriate prompts matters a lot, especially only preserving the prompts among a certain difficulty range. Inspired by Qwen-2.5-Math, which filtered prompts according to the accuracy of the initial policy model beforehand, we perform online prompt filtering throughout the training. We sample multiple trajectories for each prompt, then calculate the accuracy and preserve the prompts with accuracy scores within a certain range. This also balanced the training data distribution for PRM update.
We conducted experiments validating this prompt filtering strategy. We sampled 4 trajectories for each prompt and set the range as , which means we discard both the prompts that are too easy and too hard. We plot the training rewards in the figure below.
From the results, we can see that online prompt filter largely lowers the variance of RL training.
RL Algorithms
We compared different online RL algorithms including PPO, REINFORCE, RLOO, GRPO, and ReMax . We implemented them with verl and conducted pilot experiments with outcome verifiers as rewards. Specifically, the ground truth outcome rewards are defined as:
For these preliminary experiments, we began training with a fine-tuned Llama-3.1-8B model and report the results in Appendix. We find that REINFORCE-like algorithms, despite simpler than PPO, are strong enough to produce stable results. We choose the best performing RLOO as our RL algorithm. Note that we only adopt the advantage/return estimation function of RLOO, and use PPO policy loss with importance sampling and value clipping for training stability.
PRIME: Reinforcement Learning with PRM
Integrating PRMs into (online) reinforcement learning is not trivial, and poses several critical challenges to solve. Here we present the key challenges and how we solved them with Implicit PRM.
🤔How to provide dense rewards to reinforcement learning?
Reward sparsity has been a long-lasting problem in RL, as well as in RL for LLMs. Until now, we still have no widely accepted solutions to compose dense rewards in (online) RL for LLMs. Previous approaches mainly set up an additional value model for dense rewards, which is known to be hard to train and brings little performance gains. Therefore, it is unclear how can we incorporate process rewards into RL practices.
💡We seamlessly utilize process rewards for every token in advantage/return estimation.
Under our reward modeling objective , we can obtain process rewards at token-level from implicit PRMs for free. In this way, our PRM could directly replace the value model in PPO, making it extremely easy to combine with any advantage estimation functions and outcome rewards. In practice, we integrated process rewards with REINFORCE, RLOO, GRPO, ReMax, and PPO with minor modification.
🤔How to set up a good PRM to start RL?
Even if we find a path to use process rewards in RL, training good PRMs to start with is also non-trivial. Practitioners need to collect large-scale (process) reward data which is expensive and the model should achieve a good balance between generalization and distribution shift.
💡Start with your policy model as PRM.
Implicit PRM is inherently a language model. So theoretically, you can use any language model as the PRM. In practice, we find that the starting policy model itself serves as a great (if not the best) initialization of PRM. That means, you only need one model to start your RL journey! This makes RL with implicit PRMs unprecedentedly more accessible than ever before.
🤔How to update PRM online to prevent reward hacking?
In online RL, it is crucial that your RM is not overoptimized or hacked, which requires the RM to keep updating along with the policy model. However, given the expensiveness of step labels, it is difficult to update PRMs during RL training. This brought considerable scalability and generalization concerns in PRM for RL.
💡Implicit PRMs only demand outcome labels to update.
That is to say, with outcome verifiers, we can easily update our PRMs during training! In experiments, we illustrate the importance of online PRM. Moreover, we can also do double-forward, where we first update the PRM with on-policy rollouts, then re-calculate the process rewards with the updated PRM, and thus provide an even more accurate reward estimation.
PRIME Algorithm
We describe our final algorithm in this section. First, we illustrate the full cycle of PRIME with animation.
The policy model and PRM are both initialized with the SFT model. For each RL iteration, the policy model first generates rollouts. Then, the implicit PRM and outcome verifier score the rollouts, and the implicit PRM get updated on the rollouts with outcome reward. Finally, the outcome reward and process reward are combined and used to update the policy model.
Implementation
We present pseudo code here:
The algorithm flow includes:
Prompt filtering based on policy model performance, only preserving those on which the policy model achieves a accuracy between 0.2 and 0.8.
Calculate implicit process reward .
Update Implicit PRM based on predicted implicit process reward and ground truth outcome label .
Advantage estimation with RLOO. Specifically, we first calculate the return of outcome rewards and implicit process rewards separately:
For ground truth outcome rewards, we directly adopt RLOO without any modification.
For implicit process rewards, we perform a three-step process to calculate return: (1) Use the averaged implicit process rewards to calculate the leave-one-out baseline. (2) Normalize the process reward at step by subtracting the baseline; (3) Calculate the discounted return for each response.
Finally, advantage is set to the combination of both returns.
Update the policy using PPO loss for legit importance sampling.
Experiments
Settings
By default, we initialize the implicit PRM with SFT model and retain the SFT model for reference logprobs. For hyperparameters, we use a constant 5e-7 learning rate together with AdamW optimizer for policy model, and use 1e-6 learning rate for PRM. Both policy and PRM use a mini batchsize of 256 and micro batchsize of 8. The rollout stage collects 256 prompts and samples 4 responses for each prompt. We set for PRM training. We set KL coefficient to 0 in all experiments.
Main Results
We first present the effect of dense rewards in reinforcement learning. Here we compare PRIME with RLOO w/ outcome verifier (OV) only, which means there are only ground truth outcome rewards for each trajectory. We trained this model for 240 steps. For PRIME, we use the same setting and trained the model for 592 steps. We plot the training rewards measured by outcome verifier and test accuracy in the following figures. Compared with sparse reward, PRIME accelerates RL training to 2.5 and improves the final rewards by 6.9%, with lower variances. On downstream tasks, PRIME also consistently outperforms OV only setup.
Figure: Training outcome rewards. For fair comparison, we cut the training steps at 240.
Figure: Test accuracy comparision.
We list detailed results below. We can see that at the same 240 step, model trained by PRIME is generaly better than model trained by outcome rewards, leading to a 4 point performance gap. PRIME could further enhance model with more training steps.
Method | Step | AIME 2024 | AMC | MATH-500 | Minerva Math | OlympiadBench | LeetCode | LiveCodeBench | Math Avg. | Avg. |
---|---|---|---|---|---|---|---|---|---|---|
Eurus-2-7B-SFT | 0 | 3.3 | 30.1 | 66.2 | 32.7 | 29.8 | 21.7 | 17.8 | 32.2 | 28.8 |
RLOO w/ OV Only | 240 | 20.0 | 47.0 | 73.2 | 36.4 | 35.4 | 28.3 | 26.7 | 42.2 | 36.9 |
PRIME | 80 | 20.0 | 41.0 | 68.2 | 38.2 | 37.0 | 26.7 | 26.6 | 40.9 | 36.8 |
160 | 13.3 | 42.2 | 72.0 | 37.1 | 38.7 | 26.7 | 25.6 | 40.7 | 36.5 | |
240 | 20.0 | 50.6 | 78.2 | 39.3 | 40.3 | 31.1 | 27.5 | 45.7 | 41.0 | |
320 | 16.7 | 51.8 | 77.8 | 39.7 | 41.5 | 36.1 | 28.5 | 45.5 | 41.7 | |
592 | 26.7 | 57.8 | 79.2 | 38.6 | 42.1 | 33.3 | 28.6 | 48.9 | 43.9 |
Effect of Online PRM
We introduced online PRM, which updates with policy model rollouts and their corresponding verifier outcomes. Here we demonstrate the importance of online update for PRMs. We compare two settings, where the online PRM is initialized by Eurus-2-7B-SFT and the offline PRM is EurusPRM-Stage1. From the figures below, We can see that, online PRM outperforms offline PRM by a large margin on both training and test sets.
Effect of Reference Policy
We implement two variants of our algorithms to explore the effect of reference policy, one using the initial SFT model as reference model while the other using the running policy’s old logprobs as reference, as shown in the figures below. The above one (policy ref) simply adopts the old logprob of policy model as , while the below one (SFT ref) remains the initial SFT model for an additional calculation. We compare their performance in this section.
Figure: Policy ref, We discard the reference policy and use the old logprob as for PRM
Figure: SFT ref, We retrain the initial policy to provide for PRM and KL
Step | SFT Ref | Policy Ref |
---|---|---|
80 | 36.8 | 36.7 |
160 | 36.5 | 38.4 |
240 | 41.0 | 40.5 |
320 | 41.7 | 41.0 |
From the training rewards and test accuracy, we find the two strategies are close, and they have pros and cons in different aspects: Policy ref only needs two models in RL training, while SFT ref requires one more reference model. On the other hand, KL divergence calculation is only allowed when the initial SFT model is retained.
Single-Forward v.s. Double-Forward
Since our implicit PRM is concurrently updated in training, for each rollout stage, we can update PRM before policy model and use the updated PRM to re-calculate the process rewards, which we call the double-forward setting. We investigate the impact of double-forward in both training and test phase. Our default setting applies single-forward, which uses process rewards from old PRMs. We plot PRM accuracy on rollouts and training rewards below.
Accordingly, we find that double-forward could increase PRM accuracy, but the training rewards remain close between the two methods.
We also compare the average testset accuracy of single and double-forward. Their performances are also close. Single double-forward brings more computation overhead, we recommend single-forward setting in practice.
Step | Single-Forward | Double-Forward |
---|---|---|
80 | 36.8 | 35.7 |
160 | 36.5 | 37.4 |
240 | 41.0 | 40.4 |
320 | 41.7 | 41.0 |
Inference Scaling with Implicit PRM
Despite RL, implicit PRM could further scale inference-time computation through Best-of-N sampling. In this section, we present EurusPRM, a SOTA-level open-source PRM for Best-of-N sampling.
PRM Training
We introduce a two-stage training pipeline upon Qwen2.5-Math-7B-Instruct for EurusPRM. We collected instructions with ground truth and employ Qwen2.5-Math-7B-Base, Llama-3.1-8B-Base/Instruct, Llama-3.1-70B-Instruct, Qwen2.5-72B-Instruct, and our SFT model to sample rollouts. Training datasets statistics can be found in Appendix.
Stage 1: Training on Complete Response-level Rollouts
We applied the above to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.
Stage 2: Training on Manufactured Partial Step-level Pairs
We started the second-stage training on top of the first-stage model with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with with a learning rate of 5e-7 and a batch-size of 64.
PRM Evaluation
Evaluation Base Model
We adopt Eurus-2-7B-SFT, Qwen2.5-7B-Instruct and Llama-3.1-70B-Instruct as generation models to evaluate the performance of our implicit PRM. For all models, we set the sampling temperature as 0.5, p of the top-p sampling as 1.
Best-of-N Sampling
We use Best-of-64 as our evaluation metric. The weighting methods are different for several PRMs below.
- For Skywork-o1-Open-PRM-Qwen-2.5-7B, we use simple average reward across all steps.
- For EurusPRM-Stage 1, we use the minimum reward across all steps.
- For EurusPRM-Stage 2, we use the accumulative rewards.
Eurus-2-7B-SFT
Method | Reward Model | MATH | AMC | AIME_2024 | OlympiadBench | Minerva Math | Avg |
---|---|---|---|---|---|---|---|
Greedy Pass @ 1 | N/A | 65.1 | 30.1 | 3.3 | 29.8 | 32.7 | 32.2 |
Majority Voting @ 64 | N/A | 65.6 | 53.0 | 13.3 | 39.1 | 22.4 | 38.7 |
Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 47.2 | 45.8 | 10.0 | 32.3 | 16.2 | 30.3 |
EurusPRM-Stage 1 | 44.6 | 41.0 | 6.7 | 32.9 | 17.3 | 28.5 | |
EurusPRM-Stage 2 | 47.2 | 43.4 | 13.3 | 33.8 | 19.2 | 31.4 | |
Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 64.6 | 55.4 | 13.3 | 41.3 | 23.2 | 39.6 |
EurusPRM-Stage 1 | 66.0 | 54.2 | 13.3 | 39.6 | 29.0 | 40.4 | |
EurusPRM-Stage 2 | 66.0 | 54.2 | 13.3 | 39.7 | 29.0 | 40.4 |
Llama-3.1-70B-Instruct
Method | Reward Model | MATH | AMC | AIME 2024 | OlympiadBench | Minerva Math | Avg |
---|---|---|---|---|---|---|---|
Greedy Pass @ 1 | N/A | 64.6 | 30.1 | 16.7 | 31.9 | 35.3 | 35.7 |
Majority Voting @ 64 | N/A | 80.2 | 53.0 | 26.7 | 40.4 | 38.6 | 47.8 |
Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 77.8 | 56.6 | 23.3 | 39.0 | 31.6 | 45.7 |
EurusPRM-Stage 1 | 77.8 | 44.6 | 26.7 | 35.3 | 41.5 | 45.2 | |
EurusPRM-Stage 2 | 80.6 | 59.0 | 20.0 | 37.6 | 44.9 | 48.4 | |
Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 81.2 | 56.6 | 23.3 | 42.4 | 38.2 | 48.3 |
EurusPRM-Stage 1 | 80.4 | 53.0 | 26.7 | 40.9 | 46.7 | 49.5 | |
EurusPRM-Stage 2 | 80.4 | 53.0 | 26.7 | 41.0 | 46.3 | 49.5 |
Qwen2.5-7B-Instruct
Method | Reward Model | MATH | AMC | AIME 2024 | OlympiadBench | Minerva Math | Avg |
---|---|---|---|---|---|---|---|
Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 85.2 | 60.2 | 20.0 | 44.7 | 32.7 | 48.6 |
EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 | |
EurusPRM-Stage 2 | 86.0 | 59.0 | 16.7 | 41.4 | 41.5 | 48.9 | |
Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | 45.2 | 48.0 | |
EurusPRM-Stage 2 | 84.8 | 53.0 | 16.7 | 43.2 | 45.6 | 48.7 |
Appendix
SFT Data and Training Details
The SFT data statistics are as follows:
Task | Dataset | Size | Avg. Response Length | Source |
---|---|---|---|---|
Math | MathInstruct-MATH | 12715 | 964.01 | https://huggingface.co/datasets/TIGER-Lab/MathInstruct |
OpenMathInstruct-2-Augmented_Math | 15086 | 1202.25 | https://huggingface.co/datasets/nvidia/OpenMathInstruct-2 | |
Numina | 55845 | 1331.61 | https://huggingface.co/datasets/AI-MO/NuminaMath-CoT | |
reasoning-001 | 29831 | 1316.49 | https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01 | |
Coding | Code-Feedback | 27663 | 1805.16 | https://huggingface.co/datasets/m-a-p/Code-Feedback |
Magicoder | 24480 | 1828.72 | https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K | |
Magicoder-OSS | 28980 | 1850.05 | https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K | |
Biomedicine | UltraMedical_mc | 35163 | 891.06 | https://huggingface.co/datasets/TsinghuaC3I/UltraMedical |
Total / Avg. | - | 229763 | 1390.75 | - |
Training Details
The following hyperparameters were used during training:
Parameter | Value |
---|---|
Fine-tuning Type | Full |
Data Max Length | 6144 |
Learning Rate | 1e-05 |
GPU Batch Size | 2 |
Seed | 42 |
Gradient Accumulation | 2 |
Train Batch Size | 96 |
Optimizer | OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 |
LR Schedule | Cosine |
Warmup Ratio | 0.1 |
Epochs | 3 |
RL Data Preprocessing
Data Filtering and Question-Type Classification
The preprocessing pipeline employs a systematic rule-based approach to filter and classify mathematical problems to create a high-quality dataset with solvable problems, appropriate difficulty levels, and correct solutions.
We exclude problems containing figures or diagrams since they require visual processing capabilities. We also remove proof questions due to difficulties in answer verification. The remaining problems are classified into question-answering, multiple-choice, or fill-in-the-blank questions based on specific patterns. Since fill-in-the-blank questions comprise less than 400 examples compared to the much larger set of multiple-choice questions, we focus solely on multiple-choice questions for further processing.
Converting to Direct Question-Answer Format
We transform multiple-choice questions into a direct question-answer format through three sequential stages: rule-based filtering, LLM-based filtering, and LLM-based formatting.
We first identify and remove questions that inherently require multiple-choice options - specifically, those where comparing specific statements or properties is essential to the problem-solving process. These questions cannot be meaningfully converted to a direct question-answer format. The initial filtering employs simple rule-based pattern matching, searching for keywords like "following" and "statement" that typically indicate option-dependent problems.
Following the rule-based filtering, we employ Llama-3.1-8B-Instruct to perform a more nuanced classification of the remaining questions. Our pilot study revealed that while the LLM occasionally misclassifies questions, it tends to err on the conservative side - marking potentially convertible questions as requiring options rather than the reverse. Given our large dataset, we accepted this conservative approach to maintain quality.
For questions classified as convertible, we implement a two-phase reformatting process:
- Question Reformatting: Removing choice indicators and restructuring the question to elicit direct answers
- Solution Reformatting: Converting multiple-choice solutions into step-by-step derivations, ensuring all final answers are presented in standard LaTeX boxed format
This systematic approach maintains mathematical rigor while creating a standardized format suitable for downstream applications.
Problem and Solution Validation
The final stage involves merging all question-answer pairs and performing LLM-based comprehensive validation. We identify two key aspects in validation: solvability and correctness.
We leverage state-of-the-art mathematical reasoning models, including QwQ-32B-Preview and Qwen2.5-Math-72B-Instruct, employing a self-consistency approach to determine problem solvability, and if solvable, verify the correctness of solutions provided in the original dataset.
To enhance validation accuracy, we first analyzed sample problems to identify characteristics of solvable and unsolvable cases and created synthetic unsolvable problems featuring missing conditions or logical contradictions. Based on these samples, we developed specialized prompts to improve the models' ability to distinguish solvability.
Each problem undergoes five independent validation attempts, where the LLM:
- Provides step-by-step solutions using LaTeX formatting
- Identifies insolvability due to missing conditions or logical contradictions
- Generates complete reasoning traces for solvable problems
- Presents final answers in standardized LaTeX boxed format (
\boxed{}
) - Documents any impediments to solution completion
We evaluate two key consistency measures across multiple validation attempts:
- Status Consistency: Agreement on problem solvability
- Answer Consistency:
- Consistency of solutions across different attempts
- Agreement between generated solutions and ground truth
The final dataset retains only problems that demonstrate:
- Consistent solvability across validation attempts
- Agreement in solutions across multiple attempts
- Alignment with ground truth answers
This rigorous validation process ensures the resulting dataset comprises well-defined, solvable problems with verified, accurate solutions.
PRM Data
Stage 1
The dataset statistics of Stage 1 Training are listed below:
Dataset | Generator Model | Num. Inst | Resp/Inst | Step-level/Response-level |
---|---|---|---|---|
UltraInteract | Llama-3.1-8B-Inst | 20177 | 8 | Response-level |
UltraInteract | Llama-3.1-8B-Base | 13570 | 8 | Response-level |
UltraInteract | Qwen2.5-72B-Inst | 4758 | 8 | Response-level |
UltraInteract | Qwen2.5-Math-7B-Base | 25713 | 8 | Response-level |
Numina-SynMath | Llama-3.1-8B-Inst | 4783 | 8 | Response-level |
Numina-SynMath | Qwen2.5-Math-7B-Base | 5806 | 8 | Response-level |
Numina-Olympiads | Llama-3.1-8B-Inst | 2909 | 8 | Response-level |
Numina-Olympiads | Qwen2.5-Math-7B-Base | 4739 | 8 | Response-level |
Stage 2
The dataset statistics of Stage 2 Training are listed below:
Dataset | Generator Model | Num. Inst | Resp/Inst | Step-level/Response-level |
---|---|---|---|---|
MATH | Llama-3.1-70B-Inst | 4715 | 2 | Step-level |
MATH | Qwen2.5-72B-Inst | 6098 | 2 | Step-level |
UltraInteract | Llama-3.1-70B-Inst | 4238 | 2 | Response-level |
Other Results
Results of Different RL Algorithms
The results of different RL algorithms on Llama-3.1-8B are listed below. Since we used a different base model and dataset for the pilot study, the benchmarks used here are slightly different from the main experiments.
Step | Algorithm | Minerva Math | Olympiad Bench | HumanEval | LeetCode | LiveCode Bench | Avg. |
---|---|---|---|---|---|---|---|
256 | PPO | 21.7 | 18.2 | 62.8 | 13.3 | 17.1 | 26.6 |
REINFORCE | 21.7 | 19.0 | 64.6 | 13.9 | 17.1 | 27.3 | |
GRPO | 22.8 | 18.4 | 59.2 | 16.1 | 17.3 | 26.8 | |
ReMax | 22.8 | 19.6 | 58.5 | 12.8 | 15.8 | 25.9 | |
RLOO | 18.8 | 20.7 | 60.4 | 16.1 | 17.8 | 26.8 | |
1024 | REINFORCE | 19.5 | 16.0 | 57.3 | 21.1 | 16.0 | 26.0 |
GRPO | 22.4 | 20.3 | 57.3 | 13.3 | 18.7 | 26.4 | |
ReMax | 24.6 | 17.3 | 61.0 | 21.1 | 18.6 | 28.5 | |
RLOO | 21.0 | 20.6 | 57.9 | 27.8 | 21.4 | 29.7 |
Citation
If you find PRIME or ImplicitPRM helpful, please cite them.
@misc{cui2024process,
title={Process Reinforcement through Implicit Rewards},
author={Ganqu Cui and Lifan Yuan and Zefan Wang and Hanbin Wang and Wendi Li and Bingxiang He and Yuchen Fan and Tianyu Yu and Qixin Xu and Weize Chen and Jiarui Yuan and Huayu Chen and Kaiyan Zhang and Xingtai Lv and Shuo Wang and Yuan Yao and Hao Peng and Yu Cheng and Zhiyuan Liu and Maosong Sun and Bowen Zhou and Ning Ding},
year={2025}
}
@article{yuan2024implicitprm,
title={Free Process Rewards without Process Labels},
author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
journal={arXiv preprint arXiv:2412.01981},
year={2024}
}