Gated Toxicity Avoidance

Gated Toxicity Avoidance (GTA) focuses on mitigating harmful language generation in large language models (LLMs) while preserving their desirable performance characteristics like fluency and coherence. Current research emphasizes developing methods, including reinforcement learning and retrieval-augmented approaches, that effectively reduce toxicity across multiple languages and diverse prompts without significantly compromising generation quality. This work is crucial for ensuring the safe and responsible deployment of LLMs, addressing ethical concerns and promoting the development of more beneficial AI systems.

Papers

November 10, 2024

Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity Reduction
Yushi Yang, Filip Sondej, Harry Mayne, Adam Mahdi
Safety Fine Tuning Ablation Study Dead Neuron Direct Preference Neuronal Dynamic Toxicity Mitigation Gated Toxicity Avoidance

June 25, 2024

FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun, Vassilina Nikoulina
Large Language Model Toxicity Detection Large Scale Benchmark Gated Toxicity Avoidance

May 15, 2024

PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models
Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap
Toxicity Detection Multilingual Evaluation Toxicity Annotation Physical Symptom Gated Toxicity Avoidance

December 11, 2023

GTA: Gated Toxicity Avoidance for LM Performance Preservation
Heegyu Kim, Hyunsouk Cho
Language Model Text Generation Generative Language Model Gated Toxicity Avoidance

October 11, 2023

Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models
Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker
Text Generation Retrieval Augmented Toxic Text Toxicity Mitigation Gated Toxicity Avoidance

October 6, 2022

Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models
David Wingate, Mohammad Shoeybi, Taylor Sorensen
Language Model Complex Prompt Controllability Condition Contrastive Training Prompt Compression Language Model Generation Gated Toxicity Avoidance

March 6, 2022

Leashing the Inner Demons: Self-Detoxification for Language Models
Canwen Xu, Zexue He, Zhankui He, Julian McAuley
Language Model Training Corpus Trading Devil Non Toxic Toxic Language Fine Grained Detoxification Gated Toxicity Avoidance

February 19, 2022

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models
Farshid Faal, Ketra Schmitt, Jia Yuan Yu
Language Model Transformer Based Language Model Detoxification Model Gated Toxicity Avoidance Language Model Detoxification