Adversarial Demonstration

Adversarial demonstration attacks exploit the vulnerability of machine learning models, particularly large language models (LLMs) and image classifiers, to manipulated training data. Research focuses on understanding how carefully crafted, malicious examples within demonstration sets can degrade model performance across various aspects like accuracy, fairness, and privacy, even in models with high overall accuracy. Current work investigates attack strategies, such as crafting noise-like adversarial examples for image classification or manipulating in-context learning prompts for LLMs, and explores defensive techniques like improved model training and trajectory filtering for imitation learning. These findings are crucial for developing more robust and trustworthy AI systems across diverse applications.

Papers