Harmful Data

Harmful data, encompassing toxic text, inappropriate images, and misinformation, poses a significant threat to the development and deployment of machine learning models, particularly large language models (LLMs). Current research focuses on detecting and mitigating the impact of such data through techniques like dataset watermarking, improved annotation frameworks, and adversarial training methods designed to enhance model robustness against harmful fine-tuning. These efforts are crucial for ensuring the safety and ethical use of AI systems, impacting both the reliability of model outputs and the broader societal implications of AI technology.

Papers