EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
arXiv preprint, 2026
We introduce EBPO, a method that applies empirical Bayes shrinkage to stabilize Group-Relative Policy Optimization (GRPO) for reinforcement learning from human feedback. EBPO uses a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford’s online algorithm, offering theoretical guarantees for reduced instability and empirical improvements across mathematical reasoning benchmarks including AIME and OlympiadBench.
Recommended citation: Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. (2026). EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization. arXiv preprint arXiv:2602.05165.
