Textual Backdoor Attack

Textual backdoor attacks exploit vulnerabilities in natural language processing (NLP) models by injecting malicious triggers into training data, causing the model to misclassify inputs containing those triggers. Current research focuses on developing increasingly stealthy attacks using various techniques, including modifying sentence structure, leveraging large language models for trigger generation, and manipulating attention mechanisms within the model architecture. These attacks pose a significant threat to the reliability and security of NLP systems, driving research into robust defense mechanisms and standardized evaluation frameworks to ensure the trustworthiness of deployed models.

Papers