Jailbreak Attack
Jailbreak attacks exploit vulnerabilities in large language models (LLMs) and other AI systems, aiming to bypass safety mechanisms and elicit harmful or unintended outputs. Current research focuses on developing novel attack methods, such as those leveraging resource exhaustion, implicit references, or continuous optimization via image inputs, and evaluating their effectiveness against various model architectures (including LLMs, vision-language models, and multimodal models). Understanding and mitigating these attacks is crucial for ensuring the safe and responsible deployment of AI systems, impacting both the trustworthiness of AI and the development of robust defense strategies.
Papers
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles
Zhilong Wang, Haizhou Wang, Nanqing Luo, Lan Zhang, Xiaoyan Sun, Yebo Cao, Peng Liu
Perception-guided Jailbreak against Text-to-Image Models
Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation
Haoyu Wang, Bingzhe Wu, Yatao Bian, Yongzhe Chang, Xueqian Wang, Peilin Zhao
Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu, Yue Huang, Yixin Liu, Xiang Li, Pan Zhou, Lichao Sun
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, Kui Ren
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models
Shi Lin, Rongchang Li, Xun Wang, Changting Lin, Wenpeng Xing, Meng Han
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, Cong Wang