Multi Step Adversarial Attack

Multi-step adversarial attacks aim to progressively manipulate inputs to elicit undesired outputs from machine learning models, particularly large language models and deep neural networks, exposing vulnerabilities in their robustness and safety mechanisms. Current research focuses on developing more effective attack strategies, including dynamic attacks that adapt to model defenses and those exploiting information leakage from seemingly safe responses, as well as designing robust defenses that leverage techniques like elastic weight consolidation and hierarchical classification. Understanding and mitigating these attacks is crucial for ensuring the reliability and trustworthiness of AI systems across various applications, from online safety to critical infrastructure protection.

Papers