Trojan Horse
Trojan horse attacks exploit vulnerabilities in machine learning models, particularly deep neural networks and reinforcement learning agents, by injecting malicious triggers that cause unintended behavior without significantly impacting normal operation. Current research focuses on detecting these hidden triggers within various model architectures, including large language models and diffusion models, using techniques like sparsity analysis and filter-based methods to identify and mitigate the poisoned data or model parameters. This research is crucial for enhancing the security and reliability of AI systems across diverse applications, ranging from image classification and natural language processing to autonomous systems, where the consequences of compromised models can be severe.