Adversarial Prompt
Adversarial prompts are carefully crafted inputs designed to exploit vulnerabilities in large language models (LLMs) and other AI systems, causing them to generate unintended or harmful outputs. Current research focuses on developing more effective adversarial prompt generation techniques, often employing gradient-based optimization, evolutionary algorithms, or large language models themselves as attackers, and evaluating the robustness of various LLMs (including GPT models, Llama, and others) against these attacks. This research is crucial for improving the safety and reliability of LLMs in real-world applications, as well as for developing more robust defense mechanisms against malicious exploitation. Understanding and mitigating the impact of adversarial prompts is essential for responsible AI development and deployment.
Papers
Query-Based Adversarial Prompt Generation
Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation
Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, Yang Liu
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
Jessica Quaye, Alicia Parrish, Oana Inel, Charvi Rastogi, Hannah Rose Kirk, Minsuk Kahng, Erin van Liemt, Max Bartolo, Jess Tsang, Justin White, Nathan Clement, Rafael Mosquera, Juan Ciro, Vijay Janapa Reddi, Lora Aroyo
Attacking Large Language Models with Projected Gradient Descent
Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, Stephan Günnemann