Harmful Content

Harmful content generation and detection in large language models (LLMs) and text-to-image diffusion models is a rapidly evolving research area focused on mitigating the risks of bias, toxicity, and misinformation. Current research emphasizes developing methods to prevent harmful outputs through techniques like attention re-weighting, prompt engineering, and unlearning harmful knowledge, often employing multimodal approaches and continual learning frameworks. This work is crucial for ensuring the responsible development and deployment of AI systems, impacting both the safety of online environments and the ethical considerations surrounding AI development.

Papers

April 20, 2023

"HOT" ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media
Lingyao Li, Lizhou Fan, Shubham Atreja, Libby Hemphill
ChatGPT Generated Conversation Social Medium Harmful Content User Generated Content Promise Toxic Comment ChatGPT Related ChatGPT Revised Abstract

March 3, 2023

CONTAIN: A Community-based Algorithm for Network Immunization
Elena-Simona Apostol, Özgur Coban, Ciprian-Octavian Truică
Social Network Harmful Content Network Analysis Network Immunization

November 21, 2022

Measuring Harmful Representations in Scandinavian Language Models
Samia Touileb, Debora Nozza
Language Model Harmful Content Gender Stereotype Gender Inequality

October 6, 2022

Explainable Abuse Detection as Intent Classification and Slot Filling
Agostina Calabrese, Björn Ross, Mirella Lapata
Intent Classification Harmful Content Slot Filling Abuse Detection

October 5, 2022

Addressing contingency in algorithmic (mis)information classification: Toward a responsible machine learning agenda
Andrés Domínguez Hernández, Richard Owen, Dan Saattrup Nielsen, Ryan McConville
Machine Learning Task Scheduling Misinformation Claim Content Moderation Harmful Content Online Misinformation Contingency Game

August 22, 2022

Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case Study of Political Public Figures
Lanqin Yuan, Marian-Andrei Rizoiu
Case Study Multi Task Learning Hate Speech Unseen Domain Harmful Content United State Politician Abusive Content

July 13, 2022

O-Dang! The Ontology of Dangerous Speech Messages
Marco A. Stranisci, Simona Frenda, Mirko Lai, Oscar Araque, Alessandra T. Cignarella, Valerio Basile, Viviana Patti, Cristina Bosco
NLP Field Top Level Ontology NLP Community Harmful Content Linguistic Information

June 16, 2022

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason Gabriel, William Isaac, Lisa Anne Hendricks
Large Language Model Language Model Benchmark Study Toxicity Detection Harmful Content Distinct AInality Trait

May 9, 2022

April 29, 2022

Handling and Presenting Harmful Text in NLP Research
Hannah Rose Kirk, Abeba Birhane, Bertie Vidgen, Leon Derczynski
Hate Speech NLP Community NLP Research Harmful Content Text Data

December 15, 2021

Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings
Andrew Wang, Mohit Sudhakar, Yangfeng Ji
Language Model Non Toxic Harmful Content Text Detoxification

Harmful Content

Papers

"HOT" ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media

CONTAIN: A Community-based Algorithm for Network Immunization

Measuring Harmful Representations in Scandinavian Language Models

Explainable Abuse Detection as Intent Classification and Slot Filling

Addressing contingency in algorithmic (mis)information classification: Toward a responsible machine learning agenda

Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case Study of Political Public Figures

O-Dang! The Ontology of Dangerous Speech Messages

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

TeamX@DravidianLangTech-ACL2022: A Comparative Analysis for Troll-Based Meme Classification

Detecting and Understanding Harmful Memes: A Survey

Handling and Presenting Harmful Text in NLP Research

Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings