Training Time Attack
Training-time attacks exploit vulnerabilities in the machine learning model training process to inject malicious behavior, compromising model integrity and security. Current research focuses on various attack vectors, including backdoor insertion, data poisoning, and adversarial reward manipulation, across diverse model architectures like LLMs, reinforcement learning agents, and deep neural networks. These attacks pose significant risks to the reliability and trustworthiness of AI systems across numerous applications, driving intense investigation into robust defense mechanisms and verifiable training methods. The ultimate goal is to develop models resistant to manipulation during training, ensuring the safety and security of deployed AI systems.
Papers
Defending against Reverse Preference Attacks is Difficult
Domenic Rosati, Giles Edkins, Harsh Raj, David Atanasov, Subhabrata Majumdar, Janarthanan Rajendran, Frank Rudzicz, Hassan Sajjad
VCAT: Vulnerability-aware and Curiosity-driven Adversarial Training for Enhancing Autonomous Vehicle Robustness
Xuan Cai, Zhiyong Cui, Xuesong Bai, Ruimin Ke, Zhenshu Ma, Haiyang Yu, Yilong Ren