EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

arXiv preprint, 2026

We introduce EBPO, a method that applies empirical Bayes shrinkage to stabilize Group-Relative Policy Optimization (GRPO) for reinforcement learning from human feedback. EBPO uses a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford’s online algorithm, offering theoretical guarantees for reduced instability and empirical improvements across mathematical reasoning benchmarks including AIME and OlympiadBench.

View paper on arXiv

Recommended citation: Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. (2026). EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization. arXiv preprint arXiv:2602.05165.

Share on

Twitter Facebook LinkedIn