Paper ID: 2410.01294

Endless Jailbreaks with Bijection Learning

Brian R.Y. Huang, Maximilian Li, Leonard Tang

Despite extensive safety training, LLMs are vulnerable to adversarial inputs. In this work, we introduce a simple but powerful attack paradigm, bijection learning, that yields a practically endless set of jailbreak prompts. We exploit language models' advanced reasoning capabilities to teach them invertible languages (bijections) in context, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English, yielding helpful replies to harmful requests. Our approach proves effective on a wide range of frontier language models and harm categories. Bijection learning is an automated and universal attack that grows stronger with scale: larger models with more advanced reasoning capabilities are more susceptible to bijection learning jailbreaks despite stronger safety mechanisms.

Submitted: Oct 2, 2024

Topics

Language Model
Jailbreak Attack
Adversarial Input
Attack Paradigm

Links

arXiv PDF