Adversarial Text Purification

Adversarial text purification aims to remove malicious alterations from text data, thereby improving the robustness of text classifiers against adversarial attacks. Current research focuses on leveraging large language models (LLMs) and diffusion models to purify text by reconstructing clean versions from adversarially perturbed inputs, often guided by generated captions or through masked language modeling. This approach offers a classifier-agnostic defense mechanism, enhancing the reliability of text-based systems in applications vulnerable to manipulation, such as sentiment analysis and spam detection. The effectiveness of these methods in improving classifier accuracy under attack is a key area of ongoing investigation.

Papers