Adversarial Prompt
Adversarial prompts are carefully crafted inputs designed to exploit vulnerabilities in large language models (LLMs) and other AI systems, causing them to generate unintended or harmful outputs. Current research focuses on developing more effective adversarial prompt generation techniques, often employing gradient-based optimization, evolutionary algorithms, or large language models themselves as attackers, and evaluating the robustness of various LLMs (including GPT models, Llama, and others) against these attacks. This research is crucial for improving the safety and reliability of LLMs in real-world applications, as well as for developing more robust defense mechanisms against malicious exploitation. Understanding and mitigating the impact of adversarial prompts is essential for responsible AI development and deployment.
Papers
PromptAttack: Probing Dialogue State Trackers with Adversarial Prompts
Xiangjue Dong, Yun He, Ziwei Zhu, James Caverlee
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong, Xing Xie