Refusal Training

Refusal training in large language models (LLMs) aims to prevent the generation of harmful or inappropriate content while avoiding the over-blocking of safe queries. Current research focuses on improving the accuracy and nuance of refusal mechanisms, exploring techniques like adversarial training, representation space manipulation, and fine-grained content moderation at the token level within models such as GPT and LLaMA. This work is crucial for ensuring the safe and responsible deployment of LLMs, addressing issues like false refusals and vulnerabilities to adversarial attacks, and ultimately improving the usability and trustworthiness of these powerful technologies.

Papers