Instruction Backdoor Attack

Instruction backdoor attacks exploit the vulnerabilities of instruction-tuned large language models (LLMs) by embedding malicious triggers within instructions, causing the model to generate harmful outputs upon encountering these triggers. Current research focuses on developing both sophisticated attack methods, including multimodal attacks targeting visual-language models and compositional attacks that hide malicious intent within seemingly harmless instructions, and robust defense mechanisms, such as embedding-based adversarial removal and test-time mitigation strategies using defensive demonstrations. Understanding and mitigating these attacks is crucial for ensuring the safe and reliable deployment of LLMs in various applications, impacting both the security of AI systems and the trustworthiness of AI-driven services.

Papers