Monotone and Conservative Policy Iteration Beyond the Tabular Case

S.R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal*

March 2026 Reinforcement Learning

PDF

Abstract

We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook monotonicity of value estimates and that these estimates provably lower-bound the true return; moreover, their limit partially satisfies the unprojected Bellman equation. CRPI shares RPI’s evaluation, but updates policies conservatively by maximizing a new performance-difference lower bound that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI’s guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants.

Type

Conference paper

Publication

International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

Reinforcement Learning Policy Iteration Function Approximation

Monotone and Conservative Policy Iteration Beyond the Tabular Case

Abstract

Related