Reward Poisoning Attack

Reward poisoning attacks manipulate the reward signals in reinforcement learning (RL) systems to induce undesirable agent behavior. Current research focuses on developing effective attack strategies across various RL settings, including online and offline learning, single-agent and multi-agent scenarios, and different model architectures like deep neural networks and multi-armed bandits. These attacks highlight vulnerabilities in RL algorithms and their applications, particularly in safety-critical systems and those involving human feedback, such as large language models. Understanding and mitigating these attacks is crucial for ensuring the robustness and reliability of RL systems in real-world deployments.

Papers