ThinkPRM: Process Reward Models That Think

Abstract

Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.

Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search.

Main Results

Left: Verifier F1-score on ProcessBench (Zheng et al., 2024). THINKPRM-14B, trained on 8K process labels or 1K synthetic examples, outperforms discriminative PRMs trained on about 100x more data. Right: Accuracy over MATH-500 using Llama-3.2-3B-Instruct as generator with different verifiers. THINKPRM-1.5B, trained using the same 8K labels, outperforms LLM-as-a-judge and discriminative verifiers in reward-guided search on MATH-500. The LLM-as-a-judge in both figures uses the same base model as THINKPRM.

Scaling Verification Compute

Plot showing ThinkPRM's scaling capabilities with increased computation

ThinkPRM supports scaling verification compute by thinking longer. The graph shows F1-scores on ProcessBench as we increase the thinking budget (number of tokens). ThinkPRM consistently outperforms both LLM-as-a-judge and discriminative PRM (DiscPRM) baselines, with performance improving as we allow more computation. While LLM-as-a-judge shows inconsistent performance with increased compute, ThinkPRM demonstrates stable improvements, reaching peak performance around 24K tokens.

Best-of-N Selection Performance

$Best-of-N performance comparison on AIME '24 and MATH-500$

Best-of-N performance on AIME '24 and MATH-500. Compared to LLM-as-a-judge, DiscPRM, and (unweighted) majority vote, ThinkPRM-14B exhibits the best accuracy scaling curve. On AIME '24 (left), using Qwen2.5-32B-Instruct as generator, ThinkPRM-14B consistently outperforms baselines across different numbers of solutions. On MATH-500 (right), with Qwen2.5-14B as generator, ThinkPRM-14B shows superior scaling behavior, achieving the highest accuracy as we increase the number of candidate solutions, demonstrating its effectiveness in selecting the best solution from a larger pool of candidates.

Out-of-Domain Performance

Best-of-N performance on two out-of-domain tasks: science QA (GPQA-Physics) and code generation (LiveCodeBench). Although ThinkPRM was only finetuned on math, it exhibits superior out-of-domain performance compared to the baselines, especially at larger sampling budgets. On GPQA-Physics (left), using Qwen2.5-32B-Instruct as generator, ThinkPRM-14B maintains consistent improvement as the number of solutions increases. On LiveCodeBench (right), with Qwen2.5-Coder-7B as generator, ThinkPRM-14B significantly outperforms all baselines including DiscPRM-14B, which struggles despite being trained on an order of magnitude more process labels. This demonstrates ThinkPRM's strong generalization capabilities beyond its training domain.

Key Results

Outperforms baselines using only 1% of process labels in PRM800K
Surpasses discriminative verifiers by 8% on GPQA-Diamond
Achieves 4.5% improvement on LiveCodeBench
Scales verification compute more effectively than LLM-as-a-Judge, with 7.2% better performance on ProcessBench

Method

ThinkPRM introduces a novel approach to process verification by leveraging the inherent reasoning capabilities of large language models. Our key innovations include:

Chain-of-Thought Verification: Instead of binary judgments, ThinkPRM generates detailed reasoning chains to verify each solution step.
Data Efficiency: By utilizing verbalized step-wise reward models, we achieve state-of-the-art performance with just 1% of the training data required by traditional approaches.
Scalable Compute: Our approach effectively scales verification compute while maintaining high accuracy.

Impact

ThinkPRM represents a significant advancement in process verification and reward modeling:

Reduced Data Requirements: By reducing the need for extensive process labels, ThinkPRM makes process verification more accessible and practical.
Improved Accuracy: Our approach demonstrates superior performance across multiple challenging benchmarks.
Broader Applications: The methodology can be applied to various domains including mathematical reasoning, code verification, and general problem-solving tasks.

BibTeX

@article{khalifa2025thinkprm,
  title={Process Reward Models That Think},
  author={Khalifa, Muhammad and Agarwal, Rishabh and Logeswaran, Lajanugen and Kim, Jaekyeom and Peng, Hao and Lee, Moontae and Lee, Honglak and Wang, Lu},
  journal={arXiv preprint arXiv:2504.16828},
  year={2025}
}