Paper ID: 2309.03466

Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data

Yifan Lu, Wenxuan Li, Mi Zhang, Xudong Pan, Min Yang

To protect the intellectual property of well-trained deep neural networks (DNNs), black-box watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples and extracted from suspect models using only API access, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. However, current robustness evaluations are primarily performed under moderate attacks or unrealistic settings. Existing removal attacks could only crack a small subset of the mainstream black-box watermarks, and fall short in four key aspects: incomplete removal, reliance on prior knowledge of the watermark, performance degradation, and high dependency on data. In this paper, we propose a watermark-agnostic removal attack called \textsc{Neural Dehydration} (\textit{abbrev.} \textsc{Dehydra}), which effectively erases all ten mainstream black-box watermarks from DNNs, with only limited or even no data dependence. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss and achieve data-free watermark removal on five of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Dehydra} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with existing removal attacks, \textsc{Dehydra} achieves strong removal effectiveness across all the covered watermarks, preserving at least $90\%$ of the stolen model utility, under the data-limited settings, i.e., less than $2\%$ of the training data or even data-free.

Submitted: Sep 7, 2023