Adversarial Context

Adversarial context research explores how seemingly innocuous contextual information can be manipulated to negatively impact the performance or security of machine learning models, particularly large language models (LLMs). Current research focuses on developing both attack methods, such as crafting adversarial prompts or examples to induce undesired behavior, and defense mechanisms, including techniques like adversarial training and code-style instruction prompting. This field is crucial for ensuring the robustness and trustworthiness of AI systems, with implications for various applications ranging from autonomous driving to question-answering systems and impacting the broader discussion around AI safety and regulation.

Papers