Pre Training Data Detection
Pre-training data detection focuses on identifying whether a given text or image was used to train a large language model (LLM) or other deep learning model, addressing concerns about data privacy, copyright infringement, and benchmark contamination. Current research explores methods leveraging probability distributions of tokens or features, analyzing model internal activations, and employing techniques like membership inference attacks and divergence-based calibration to improve detection accuracy. This research is crucial for ensuring responsible AI development and deployment, impacting areas such as copyright protection, data security, and the trustworthiness of model evaluations.
Papers
November 14, 2024
November 5, 2024
October 10, 2024
September 23, 2024
July 30, 2024
July 27, 2024
July 13, 2024
June 3, 2024
April 3, 2024
March 14, 2024
March 8, 2024
December 14, 2023
October 25, 2023
February 4, 2023