Preference Poisoning

Preference poisoning attacks exploit vulnerabilities in machine learning systems, particularly those trained using human feedback, by manipulating preference data to induce undesirable behaviors. Current research focuses on understanding the effectiveness of various poisoning strategies across different model architectures, including those based on reinforcement learning and federated learning, and exploring defensive mechanisms such as constrained optimization and robust loss functions. This research is crucial for ensuring the safety and reliability of AI systems, especially those impacting decision-making in high-stakes applications like recommendation systems and search engines, where malicious actors could manipulate user preferences for their own gain.

Papers