Fine Grained Detoxification

Fine-grained detoxification of large language models (LLMs) focuses on mitigating the generation of harmful or biased content, aiming for safer and more responsible deployment. Current research explores both training-based methods, such as adapting model parameters to align with human preferences, and decoding-based approaches that modify the generation process in real-time using techniques like subspace projection or controlled sampling. These efforts are crucial for addressing the ethical concerns surrounding LLMs and improving their suitability for high-stakes applications, such as fraud detection and online moderation, where nuanced understanding of toxicity is paramount.

Papers