Dataset Watermarking

Dataset watermarking aims to protect the intellectual property of datasets used to train machine learning models by embedding imperceptible watermarks that allow for the detection of unauthorized usage. Current research focuses on developing robust watermarking techniques for various data types (images, tabular data, point clouds, text) using methods like clean-label backdoor watermarks, statistical hypothesis testing, and data perturbation, often within a black-box setting where only model outputs are accessible. This field is crucial for safeguarding valuable datasets, particularly in commercially sensitive areas like healthcare and generative AI, and ensuring fair attribution and preventing model theft.

Papers