Data Memorization

Data memorization in machine learning models, particularly large language models (LLMs) and diffusion models, refers to the phenomenon where models reproduce or closely approximate training data during inference, raising privacy and copyright concerns. Current research focuses on identifying and quantifying memorization across various architectures (e.g., transformers, diffusion models), localizing memorized data within model layers and neurons, and developing mitigation strategies such as parameter-efficient fine-tuning and prompt engineering. Understanding and controlling data memorization is crucial for responsible AI development, ensuring the trustworthiness and ethical deployment of powerful generative models across diverse applications.

Papers