Harmful Unlearning
Harmful unlearning, also known as machine unlearning, aims to remove specific data or knowledge from trained machine learning models, particularly large language models (LLMs), without complete retraining. Current research focuses on developing effective unlearning algorithms, often employing techniques like gradient-based methods, knowledge distillation, and adversarial training, across various model architectures including LLMs and diffusion models. This field is crucial for addressing privacy concerns, mitigating biases, and enhancing the safety and robustness of AI systems, impacting both data protection regulations and the trustworthiness of AI applications.
Papers
Guardrail Baselines for Unlearning in LLMs
Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks