Jailbreak Attack
Jailbreak attacks exploit vulnerabilities in large language models (LLMs) and other AI systems, aiming to bypass safety mechanisms and elicit harmful or unintended outputs. Current research focuses on developing novel attack methods, such as those leveraging resource exhaustion, implicit references, or continuous optimization via image inputs, and evaluating their effectiveness against various model architectures (including LLMs, vision-language models, and multimodal models). Understanding and mitigating these attacks is crucial for ensuring the safe and responsible deployment of AI systems, impacting both the trustworthiness of AI and the development of robust defense strategies.
Papers
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, Anyu Wang
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Sarah Ball, Frauke Kreuter, Nina Panickssery
Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs
Bangxin Li, Hengrui Xing, Chao Huang, Jin Qian, Huangqing Xiao, Linfeng Feng, Cong Tian
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu, Haozheng Luo, Jerry Yao-Chieh Hu, Wenbo Guo, Han Liu, Xinyu Xing
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin, Andy Zhou, Joe D. Menke, Haohan Wang
Efficient LLM-Jailbreaking by Introducing Visual Modality
Zhenxing Niu, Yuyao Sun, Haodong Ren, Haoxuan Ji, Quan Wang, Xiaoke Ma, Gang Hua, Rong Jin
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
Jiawei Chen, Xiao Yang, Zhengwei Fang, Yu Tian, Yinpeng Dong, Zhaoxia Yin, Hang Su