Progress Advantage for LLM Agents

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

¹University of Wisconsin-Madison ²Argonne National Laboratory

Abstract

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.

Acknowledgments

We sincerely thank Jiatong Li, Leitian Tao, Sangyun Lee, and Jiaying Fang for their faithful proofreading and professional feedback on the draft that directly affected the writing and experiment content and sparked future work ideas. This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract number DE-AC02-06CH11357. Wendi Li, Seongheon Park, Samuel Yeh and Sharon Li are supported in part by the AFOSR Young Investigator Program under award number FA9550-23-1-0184, National Science Foundation under awards IIS-2237037 and IIS-2331669, Schmidt Sciences Foundation, Open Philanthropy (now Coefficient Giving), Alfred P. Sloan Fellowship, and gifts from Google and Amazon.

@article{oh2026neglected, title = {Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents}, author = {Oh, Changdae and Li, Wendi and Park, Seongheon and Yeh, Samuel and Mallick, Tanwi and Li, Sharon}, journal = {arXiv preprint arXiv:2606.26080}, year = {2026} }

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

We introduce Progress Advantage, an implicit process reward signal derived as a byproduct of post-training, enabling step-level guidance and monitoring for LLM agents in stochastic environments.

Abstract

Acknowledgments

BibTeX