Training Time Attack

Training-time attacks exploit vulnerabilities in the machine learning model training process to inject malicious behavior, compromising model integrity and security. Current research focuses on various attack vectors, including backdoor insertion, data poisoning, and adversarial reward manipulation, across diverse model architectures like LLMs, reinforcement learning agents, and deep neural networks. These attacks pose significant risks to the reliability and trustworthiness of AI systems across numerous applications, driving intense investigation into robust defense mechanisms and verifiable training methods. The ultimate goal is to develop models resistant to manipulation during training, ensuring the safety and security of deployed AI systems.

Papers