Adversarial Markov Decision Process

Adversarial Markov Decision Processes (AMDPs) model sequential decision-making problems where the environment's dynamics or reward structure can change adversarially, challenging the agent's ability to learn optimal policies. Current research focuses on developing algorithms with improved regret bounds—measuring the difference between an agent's performance and the optimal policy—under various settings, including bandit feedback, delayed feedback, and function approximation using methods like policy optimization and Follow-the-Perturbed-Leader. These advancements are significant for improving the robustness and efficiency of reinforcement learning algorithms in unpredictable or malicious environments, with applications ranging from robotics and resource allocation to cybersecurity and multi-agent systems.

Papers