Refusal Training
Refusal training in large language models (LLMs) aims to prevent the generation of harmful or inappropriate content while avoiding the over-blocking of safe queries. Current research focuses on improving the accuracy and nuance of refusal mechanisms, exploring techniques like adversarial training, representation space manipulation, and fine-grained content moderation at the token level within models such as GPT and LLaMA. This work is crucial for ensuring the safe and responsible deployment of LLMs, addressing issues like false refusals and vulnerabilities to adversarial attacks, and ultimately improving the usability and trustworthiness of these powerful technologies.
Papers
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda
Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
Jie Li, Yi Liu, Chongyang Liu, Xiaoning Ren, Ling Shi, Weisong Sun, Yinxing Xue