arxiv:2501.07301

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Published on Jan 13

· Submitted by

chujiezheng on Jan 14

#1 Paper of the day

Upvote

Authors:

Zhenru Zhang ,

Chujie Zheng ,

Yangzhen Wu ,

Beichen Zhang ,

Runji Lin ,

Bowen Yu ,

Dayiheng Liu ,

Jingren Zhou ,

Junyang Lin

Abstract

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

View arXiv page View PDF Add to collection

Community

chujiezheng

Paper author Paper submitter 1 day ago

•

edited 1 day ago

We share our practices and lessons on building process reward models (PRMs) for mathematical reasoning, and release two strong PRMs:

chujiezheng

Paper author 1 day ago

This comment has been hidden

Zhenru

Paper author 1 day ago

This comment has been hidden

librarian-bot

about 9 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Quadyun

about 6 hours ago

At second line of "2.3 Evaluation Results", "Qwen2.5-Math-7B-PRM-MC-hard (trained with soft labels)" is not corret. That is actually "Qwen2.5-Math-7B-PRM-MC-soft (trained with soft labels)", right?