Jailbreak Attack
Jailbreak attacks exploit vulnerabilities in large language models (LLMs) and other AI systems, aiming to bypass safety mechanisms and elicit harmful or unintended outputs. Current research focuses on developing novel attack methods, such as those leveraging resource exhaustion, implicit references, or continuous optimization via image inputs, and evaluating their effectiveness against various model architectures (including LLMs, vision-language models, and multimodal models). Understanding and mitigating these attacks is crucial for ensuring the safe and responsible deployment of AI systems, impacting both the trustworthiness of AI and the development of robust defense strategies.
Papers
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Advik Raj Basani, Xiao Zhang
Global Challenge for Safe and Secure LLMs Track 1
Xiaojun Jia, Yihao Huang, Yang Liu, Peng Yan Tan, Weng Kuan Yau, Mun-Thye Mak, Xin Ming Sim, Wee Siong Ng, See Kiong Ng, Hanqing Liu, Lifeng Zhou, Huanqian Yan, Xiaobing Sun, Wei Liu, Long Wang, Yiming Qian, Yong Liu, Junxiao Yang, Zhexin Zhang, Leqi Lei, Renmiao Chen, Yida Lu, Shiyao Cui, Zizhou Wang, Shaohua Li, Yan Wang, Rick Siow Mong Goh, Liangli Zhen, Yingjie Zhang, Zhe Zhao
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli Gao, Tianwei Zhang, Tat-Seng Chua
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu