Adversarial Prompting

Adversarial prompting explores how carefully crafted inputs, or prompts, can manipulate the behavior of large language and vision-language models (LLMs and VLMs), revealing vulnerabilities and biases. Current research focuses on developing both attack methods (e.g., generating prompts to elicit harmful outputs or bypass safety mechanisms) and defense strategies (e.g., creating robust models or detection algorithms). This field is crucial for ensuring the safe and reliable deployment of these powerful models, impacting areas such as AI safety, cybersecurity, and the development of more robust and trustworthy AI systems.

Papers