Jailbreak Attack
Jailbreak attacks exploit vulnerabilities in large language models (LLMs) and other AI systems, aiming to bypass safety mechanisms and elicit harmful or unintended outputs. Current research focuses on developing novel attack methods, such as those leveraging resource exhaustion, implicit references, or continuous optimization via image inputs, and evaluating their effectiveness against various model architectures (including LLMs, vision-language models, and multimodal models). Understanding and mitigating these attacks is crucial for ensuring the safe and responsible deployment of AI systems, impacting both the trustworthiness of AI and the development of robust defense strategies.
Papers
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Honglin Mu, Han He, Yuxin Zhou, Yunlong Feng, Yang Xu, Libo Qin, Xiaoming Shi, Zeming Liu, Xudong Han, Qi Shi, Qingfu Zhu, Wanxiang Che
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang
A Realistic Threat Model for Large Language Model Jailbreaks
Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping
A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
Boosting Jailbreak Transferability for Large Language Models
Hanqing Liu, Lifeng Zhou, Huanqian Yan
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
Aidan Wong, He Cao, Zijing Liu, Yu Li