Alignment Breaking Attack
Alignment breaking attacks target the safety mechanisms built into large language models (LLMs), aiming to elicit harmful or inappropriate outputs. Current research focuses on developing increasingly sophisticated attack methods, including those leveraging obscure prompts, visual inputs, backdoor injections, and minimal data "shadow alignment" to bypass existing safety protocols, with a particular focus on multimodal models. These attacks highlight significant vulnerabilities in current LLM alignment techniques and underscore the need for more robust and resilient safety measures to ensure responsible AI development and deployment.
Papers
June 19, 2024
March 14, 2024
November 15, 2023
October 4, 2023