Content Moderation Software

Content moderation software aims to automatically identify and filter harmful online content, such as hate speech and misinformation, across various media types (text, images, video). Current research focuses on mitigating biases against marginalized groups, improving detection of subtle or disguised toxic content (e.g., implicit hate speech, text embedded in images), and developing more robust methods to prevent "jailbreaking" of large language models. These advancements are crucial for creating safer online environments and ensuring fairness in algorithmic decision-making, impacting both the development of more equitable AI systems and the practical management of online platforms.

Papers

September 9, 2024

Identity-related Speech Suppression in Generative AI Content Moderation
Oghenefejiro Isaacs Anigboro, Charlie M. Crawford, Danaë Metaxa, Sorelle A. Friedler
Generative AI Content Moderation Generative AI System Generative Algorithm Copy Suppression Content Moderation Software

June 20, 2024

Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services
David Hartmann, Amin Oueslati, Dimitri Staufer
Hate Speech Hate Speech Detection Content Moderation Live Streaming Viewer Content Moderation Software

February 21, 2024

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong
Large Language Model Complex Prompt Jailbreak Attack Gradient Correction Content Moderation Software

August 18, 2023

An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software
Wenxuan Wang, Jingyuan Huang, Jen-tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, Michael R. Lyu
Content Moderation Toxic Language Metamorphic Testing Moderation Model Content Moderation Software

May 23, 2023

Validating Multimedia Content Moderation Software via Semantic Fusion
Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang, Weibin Wu, Pinjia He, Michael Lyu
Content Moderation Semantic Fusion Moderation Model Content Moderation Software

February 11, 2023

MTTM: Metamorphic Testing for Textual Content Moderation Software
Wenxuan Wang, Jen-tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, Michael Lyu
Content Moderation Metamorphic Testing Toxic Text Content Moderation Software