Adversarial Prompt
Adversarial prompts are carefully crafted inputs designed to exploit vulnerabilities in large language models (LLMs) and other AI systems, causing them to generate unintended or harmful outputs. Current research focuses on developing more effective adversarial prompt generation techniques, often employing gradient-based optimization, evolutionary algorithms, or large language models themselves as attackers, and evaluating the robustness of various LLMs (including GPT models, Llama, and others) against these attacks. This research is crucial for improving the safety and reliability of LLMs in real-world applications, as well as for developing more robust defense mechanisms against malicious exploitation. Understanding and mitigating the impact of adversarial prompts is essential for responsible AI development and deployment.
Papers
Assessing Prompt Injection Risks in 200+ Custom GPTs
Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, Sabrina Yang, Xinyu Xing
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, Viswanathan Swaminathan