Toxicity Mitigation
Toxicity mitigation in large language models (LLMs) focuses on reducing the generation of harmful or offensive text, primarily through algorithmic interventions applied during inference or via model fine-tuning. Current research explores methods like modifying attention weights, adjusting decoding strategies (e.g., using linear subspaces to identify and avoid toxic outputs), and leveraging preference tuning with empathetic data or retrieval-augmented generation. These advancements are crucial for ensuring the safe and responsible deployment of LLMs across various applications, improving their trustworthiness and mitigating potential societal harms.
Papers
November 10, 2024
October 4, 2024
July 2, 2024
June 23, 2024
May 16, 2024
March 6, 2024
November 11, 2023
October 11, 2023
May 19, 2023