Data Copying

Data copying, encompassing both intentional replication (e.g., for efficient data streaming or model deployment) and unintentional memorization (e.g., in generative models), is a significant area of research across diverse fields. Current efforts focus on developing methods to detect and mitigate unintended data copying in machine learning models, particularly in large language models and generative models, often employing techniques like attention matrix sharing or novel loss functions. Understanding and controlling data copying is crucial for ensuring data privacy, model security, and the ethical development and deployment of AI systems, with implications ranging from copyright protection to the responsible use of synthetic data.

Papers