Paper ID: 2407.18878

Order-Optimal Global Convergence for Average Reward Reinforcement Learning via Actor-Critic Approach

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

This work analyzes average-reward reinforcement learning with general parametrization. Current state-of-the-art (SOTA) guarantees for this problem are either suboptimal or demand prior knowledge of the mixing time of the underlying Markov process, which is unavailable in most practical scenarios. We introduce a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm to address these issues. Our approach is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ without needing the knowledge of mixing time. It significantly surpasses the SOTA bound of $\tilde{\mathcal{O}}(T^{-1/4})$ where $T$ is the horizon length.

Submitted: Jul 26, 2024

Topics

Monte Carlo
Multi Agent Actor Critic
Policy Parametrization
Average Reward Reinforcement Learning

Links

arXiv PDF