Transferable Attack
Transferable attacks aim to create adversarial examples that can fool multiple machine learning models, even those unseen during the attack's design. Current research focuses on improving the transferability of these attacks across various model architectures (including CNNs, LLMs, and clustering algorithms) and data modalities (images, text, audio, and even skeletal data), often employing techniques like generative adversarial networks (GANs), Bayesian optimization, and gradient editing. This research is crucial for assessing the robustness of machine learning systems and developing effective defenses against malicious manipulations in diverse real-world applications, such as autonomous driving and cybersecurity. The overarching goal is to understand and mitigate the vulnerabilities of AI systems to these attacks.
Papers
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, Cihang Xie
The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses
Grzegorz Głuch, Berkant Turan, Sai Ganesh Nagarajan, Sebastian Pokutta