Episodic Markov Decision Process
Episodic Markov Decision Processes (EMDPs) model sequential decision-making problems where interactions conclude after a fixed number of steps, focusing on learning optimal policies to maximize cumulative rewards. Current research emphasizes developing provably efficient algorithms, particularly for model-free approaches and settings with function approximation, often employing techniques like upper confidence bounds, posterior sampling, and reference-advantage decomposition to handle stochasticity and improve sample efficiency. These advancements are significant for both theoretical understanding of reinforcement learning and practical applications, enabling faster and more robust learning in complex environments with limited data.
Papers
Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach
Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun
Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg