Monotone and Conservative Policy Iteration Beyond the Tabular Case

Abstract

We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook monotonicity of value estimates and that these estimates provably lower-bound the true return; moreover, their limit partially satisfies the unprojected Bellman equation. CRPI shares RPI’s evaluation, but updates policies conservatively by maximizing a new performance-difference lower bound that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI’s guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants.

Publication
International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

Related