Red Teaming

Red teaming, in the context of artificial intelligence, involves adversarial testing of AI models, particularly large language models (LLMs) and increasingly multimodal models, to identify vulnerabilities and biases. Current research focuses on automating this process using techniques like reinforcement learning, generative adversarial networks, and novel scoring functions to create diverse and effective adversarial prompts or inputs that expose model weaknesses. This rigorous evaluation is crucial for improving the safety, robustness, and ethical implications of AI systems, informing both model development and deployment strategies across various applications.

Papers

April 6, 2024

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li
Language Model Human SAFETY Comprehensive Benchmark Red Teaming Adversarial Testing Alert System Adversarial Misuse

April 2, 2024

Red-Teaming Segment Anything Model
Krzysztof Jankowski, Bartlomiej Sobieski, Mateusz Kwiatkowski, Jakub Szulc, Michal Janik, Hubert Baniecki, Przemyslaw Biecek
Adversarial Attack Segment Anything Model Segmentation Task Segmentation Mask Red Teaming

March 31, 2024

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin, Xudong Han, Haonan Li
Language Model Timely Survey Generative Model Red Teaming Multimodal Attack

March 12, 2024

Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI
Vladimir Zaigrajew, Hubert Baniecki, Lukasz Tulczyjew, Agata M. Wijata, Jakub Nalepa, Nicolas Longépé, Przemyslaw Biecek
Explainable AI Hyperspectral Image Red Teaming Hyperspectral Imaging System Geospatial Machine Learning

March 7, 2024

A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, Peter Henderson
Generative AI Red Teaming Generative AI System Comprehensive Trustworthiness AI Evaluation Traffic Safety Safe Deep

February 29, 2024

Curiosity-driven Red-teaming for Large Language Models
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal
Large Language Model Natural Language Red Teaming Harmful Content Curiosity Driven Exploration

February 21, 2024

AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning
Vasudev Gohil, Satwik Patnaik, Dileep Kalathil, Jeyavijayan Rajendran
Red Teaming Obfuscation Technique Attack Graph Hardware Trojan Biological Circuit Hardware Security

February 14, 2024

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
Jessica Quaye, Alicia Parrish, Oana Inel, Charvi Rastogi, Hannah Rose Kirk, Minsuk Kahng, Erin van Liemt, Max Bartolo, Jess Tsang, Justin White, Nathan Clement, Rafael Mosquera, Juan Ciro, Vijay Janapa Reddi, Lora Aroyo
Text to Image Generation Generative AI Model Adversarial Prompt Red Teaming Potential Harm Unsafe Image Imperceptible Attack Adversarial Challenge

February 6, 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks
Red Teaming LLM Robustness Refusal Response

January 30, 2024

Gradient-Based Language Model Red Teaming
Nevan Wichers, Carson Denison, Ahmad Beirami
Prompt Learning Generative Language Model Adversarial Prompt Red Teaming

January 29, 2024

January 23, 2024

Red Teaming Visual Language Models
Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu
Vision Language Model Multimodal Input Red Teaming

January 19, 2024

Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
Language Model Global Impact Red Teaming Model Accuracy Wind Pressure Model Reliability Vortex Pattern

December 30, 2023

Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks
Aleksander Buszydlik, Karol Dobiczek, Michał Teodor Okoń, Konrad Skublicki, Philip Lippmann, Jie Yang
Visual Analogue Scale Reasoning Task Red Teaming Math Question Structured Reasoning Math Task

December 11, 2023

Control Risk for Potential Misuse of Artificial Intelligence in Science
Jiyan He, Weitao Feng, Yaosen Min, Jingwei Yi, Kunsheng Tang, Shuai Li, Jie Zhang, Kejiang Chen, Wenbo Zhou, Xing Xie, Weiming Zhang, Nenghai Yu, Shuxin Zheng
Artificial Intelligence Responsible AI Science Journalism Red Teaming Risk Management

December 8, 2023

A Red Teaming Framework for Securing AI in Maritime Autonomous Systems
Mathew J. Walter, Aaron Barrett, Kimberly Tam
Red Teaming Artificial Intelligence Security Maritime Autonomous System

November 14, 2023

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, Preethi Lahoti
Red Teaming Adversarial Testing Adversarial Datasets LLM Powered Application

November 13, 2023

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, Yuning Mao
Adversarial Prompt Red Teaming LLM Safety LLM Attack

November 10, 2023

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming
Nanna Inie, Jonathan Stray, Leon Derczynski
Large Language Model Wild Challenge New Attack General Strategy Red Teaming Trading Devil Grounded Theory Effective Generation Qualitative Methodology