Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search.
Left: Verifier F1-score on ProcessBench (Zheng et al., 2024). THINKPRM-14B, trained on 8K process labels or 1K synthetic examples, outperforms discriminative PRMs trained on about 100x more data. Right: Accuracy over MATH-500 using Llama-3.2-3B-Instruct as generator with different verifiers. THINKPRM-1.5B, trained using the same 8K labels, outperforms LLM-as-a-judge and discriminative verifiers in reward-guided search on MATH-500. The LLM-as-a-judge in both figures uses the same base model as THINKPRM.
ThinkPRM supports scaling verification compute by thinking longer. The graph shows F1-scores on ProcessBench as we increase the thinking budget (number of tokens). ThinkPRM consistently outperforms both LLM-as-a-judge and discriminative PRM (DiscPRM) baselines, with performance improving as we allow more computation. While LLM-as-a-judge shows inconsistent performance with increased compute, ThinkPRM demonstrates stable improvements, reaching peak performance around 24K tokens.
Best-of-N performance on AIME '24 and MATH-500. Compared to LLM-as-a-judge, DiscPRM, and (unweighted) majority vote, ThinkPRM-14B exhibits the best accuracy scaling curve. On AIME '24 (left), using Qwen2.5-32B-Instruct as generator, ThinkPRM-14B consistently outperforms baselines across different numbers of solutions. On MATH-500 (right), with Qwen2.5-14B as generator, ThinkPRM-14B shows superior scaling behavior, achieving the highest accuracy as we increase the number of candidate solutions, demonstrating its effectiveness in selecting the best solution from a larger pool of candidates.
Best-of-N performance on two out-of-domain tasks: science QA (GPQA-Physics) and code generation (LiveCodeBench). Although ThinkPRM was only finetuned on math, it exhibits superior out-of-domain performance compared to the baselines, especially at larger sampling budgets. On GPQA-Physics (left), using Qwen2.5-32B-Instruct as generator, ThinkPRM-14B maintains consistent improvement as the number of solutions increases. On LiveCodeBench (right), with Qwen2.5-Coder-7B as generator, ThinkPRM-14B significantly outperforms all baselines including DiscPRM-14B, which struggles despite being trained on an order of magnitude more process labels. This demonstrates ThinkPRM's strong generalization capabilities beyond its training domain.
ThinkPRM introduces a novel approach to process verification by leveraging the inherent reasoning capabilities of large language models. Our key innovations include:
ThinkPRM represents a significant advancement in process verification and reward modeling:
@article{khalifa2025thinkprm,
title={Process Reward Models That Think},
author={Khalifa, Muhammad and Agarwal, Rishabh and Logeswaran, Lajanugen and Kim, Jaekyeom and Peng, Hao and Lee, Moontae and Lee, Honglak and Wang, Lu},
journal={arXiv preprint arXiv:2504.16828},
year={2025}
}