Toxicity Mitigation

Toxicity mitigation in large language models (LLMs) focuses on reducing the generation of harmful or offensive text, primarily through algorithmic interventions applied during inference or via model fine-tuning. Current research explores methods like modifying attention weights, adjusting decoding strategies (e.g., using linear subspaces to identify and avoid toxic outputs), and leveraging preference tuning with empathetic data or retrieval-augmented generation. These advancements are crucial for ensuring the safe and responsible deployment of LLMs across various applications, improving their trustworthiness and mitigating potential societal harms.

Papers