Adversarial Markov Decision Process

Adversarial Markov Decision Processes (AMDPs) model sequential decision-making problems where the environment's dynamics or reward structure can change adversarially, challenging the agent's ability to learn optimal policies. Current research focuses on developing algorithms with improved regret bounds—measuring the difference between an agent's performance and the optimal policy—under various settings, including bandit feedback, delayed feedback, and function approximation using methods like policy optimization and Follow-the-Perturbed-Leader. These advancements are significant for improving the robustness and efficiency of reinforcement learning algorithms in unpredictable or malicious environments, with applications ranging from robotics and resource allocation to cybersecurity and multi-agent systems.

Papers

June 27, 2022

On the Complexity of Adversarial Decision Making
Dylan J. Foster, Alexander Rakhlin, Ayush Sekhari, Karthik Sridharan
Adversarial Attack Complexity Matter Adversarial Markov Decision Process Adversarial Reward General Adversarial

May 30, 2022

Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning
Yinglun Xu, Qi Zeng, Gagandeep Singh
DQN Agent Adversarial Markov Decision Process Reward Poisoning Attack

May 26, 2022

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback
Yan Dai, Haipeng Luo, Liyu Chen
Bandit Feedback Regret Minimization Follow the Regularized Leader Adversarial Markov Decision Process Order Optimal Regret

February 22, 2022

No-Regret Learning with Unbounded Losses: The Case of Logarithmic Pooling
Eric Neyman, Tim Roughgarden
Case Relevance Pooling Layer Adversarial Markov Decision Process Online Mirror Descent Unbounded Loss Adversarial Information Semi Supervised Adversarial

January 31, 2022

Adversarial Markov Decision Process

Papers

On the Complexity of Adversarial Decision Making

Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

No-Regret Learning with Unbounded Losses: The Case of Logarithmic Pooling

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Cooperative Online Learning in Stochastic and Adversarial MDPs