Jailbreak Attack
Jailbreak attacks exploit vulnerabilities in large language models (LLMs) and other AI systems, aiming to bypass safety mechanisms and elicit harmful or unintended outputs. Current research focuses on developing novel attack methods, such as those leveraging resource exhaustion, implicit references, or continuous optimization via image inputs, and evaluating their effectiveness against various model architectures (including LLMs, vision-language models, and multimodal models). Understanding and mitigating these attacks is crucial for ensuring the safe and responsible deployment of AI systems, impacting both the trustworthiness of AI and the development of robust defense strategies.
Papers
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions
Rachneet Sachdeva, Rima Hazra, Iryna Gurevych
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
Yanjiang Liu, Shuhen Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models
Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, Yu Hong
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
Mintong Kang, Chejian Xu, Bo Li
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models
Yuxi Li, Zhibo Zhang, Kailong Wang, Ling Shi, Haoyu Wang