Paper ID: 2407.21316

Diff-Cleanse: Identifying and Mitigating Backdoor Attacks in Diffusion Models

Jiang Hao, Xiao Jin, Hu Xiaoguang, Chen Tianyou

Diffusion models (DM) represent one of the most advanced generative models today, yet recent studies suggest that DMs are vulnerable to backdoor attacks. Backdoor attacks establish hidden associations between particular input patterns and model behaviors, compromising model integrity by triggering undesirable actions with manipulated input data. This vulnerability poses substantial risks, including reputational damage to model owners and the dissemination of harmful content. To mitigate the threat of backdoor attacks, there have been some investigations on backdoor detection and model repair. However, previous work fails to purify the backdoored DMs created by state-of-the-art attacks, rendering the field much underexplored. To bridge this gap, we introduce \textbf{Diff-Cleanse}, a novel two-stage backdoor defense framework specifically designed for DMs. The first stage employs a innovative trigger inversion technique to detect the backdoor and reconstruct the trigger, and the second stage utilizes a structural pruning method to eliminate the backdoor. We evaluate our framework on hundreds of DMs attacked by 3 existing backdoor attack methods. Extensive experiments demonstrate that Diff-Cleanse achieves nearly 100\% detection accuracy and effectively mitigates backdoor impacts, preserving the model's benign performance with minimal compromise. Our code is avaliable at https://github.com/shymuel/diff-cleanse.

Submitted: Jul 31, 2024