Reward Poisoning Attack
Reward poisoning attacks manipulate the reward signals in reinforcement learning (RL) systems to induce undesirable agent behavior. Current research focuses on developing effective attack strategies across various RL settings, including online and offline learning, single-agent and multi-agent scenarios, and different model architectures like deep neural networks and multi-armed bandits. These attacks highlight vulnerabilities in RL algorithms and their applications, particularly in safety-critical systems and those involving human feedback, such as large language models. Understanding and mitigating these attacks is crucial for ensuring the robustness and reliability of RL systems in real-world deployments.
Papers
October 25, 2024
February 21, 2024
February 15, 2024
November 16, 2023
October 8, 2023
May 18, 2023
June 4, 2022
May 30, 2022