Harmful Unlearning
Harmful unlearning, also known as machine unlearning, aims to remove specific data or knowledge from trained machine learning models, particularly large language models (LLMs), without complete retraining. Current research focuses on developing effective unlearning algorithms, often employing techniques like gradient-based methods, knowledge distillation, and adversarial training, across various model architectures including LLMs and diffusion models. This field is crucial for addressing privacy concerns, mitigating biases, and enhancing the safety and robustness of AI systems, impacting both data protection regulations and the trustworthiness of AI applications.
Papers
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models
Myeongseob Ko, Henry Li, Zhun Wang, Jonathan Patsenker, Jiachen T. Wang, Qinbin Li, Ming Jin, Dawn Song, Ruoxi Jia
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice
A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Matthew Jagielski, Katja Filippova, Ken Ziyu Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Niloofar Mireshghallah, Ilia Shumailov, Eleni Triantafillou, Peter Kairouz, Nicole Mitchell, Percy Liang, Daniel E. Ho, Yejin Choi, Sanmi Koyejo, Fernando Delgado, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Solon Barocas, Amy Cyphert, Mark Lemley, danah boyd, Jennifer Wortman Vaughan, Miles Brundage, David Bau, Seth Neel, Abigail Z. Jacobs, Andreas Terzis, Hanna Wallach, Nicolas Papernot, Katherine Lee