Adversarial Suffix
Adversarial suffixes are short text strings appended to prompts that trick large language models (LLMs) into generating unsafe or undesirable outputs, effectively "jailbreaking" safety mechanisms. Current research focuses on understanding how these suffixes work, including their transferability across different LLMs and the development of more efficient methods for generating them, often employing gradient-based optimization or generative models. This research is crucial for improving the safety and robustness of LLMs, impacting both the development of more secure AI systems and the broader understanding of LLM vulnerabilities.
Papers
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer
Weipeng Jiang, Zhenting Wang, Juan Zhai, Shiqing Ma, Zhengyu Zhao, Chao Shen
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao, Zhihao Dou, Kaizhu Huang