Paper ID: 2405.15050
Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs
Kihyuk Hong, Woojin Chae, Yufan Zhang, Dabeen Lee, Ambuj Tewari
We study the infinite-horizon average-reward reinforcement learning with linear MDPs. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $\widetilde{O}(\sqrt{T})$. In this paper, we propose an algorithm that achieves the regret bound of $\widetilde{O}(\sqrt{T})$ and is computationally efficient in the sense that the time complexity is polynomial in problem parameters. Our algorithm runs an optimistic value iteration on a discounted-reward MDP that approximates the average-reward setting. With an appropriately tuned discounting factor $\gamma$, the algorithm attains the desired $\widetilde{O}(\sqrt{T})$ regret. The challenge in our approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - \gamma)$. We address this challenge by clipping the value function obtained at each value iteration step to limit the span of the value function.
Submitted: May 23, 2024