Paper ID: 2405.18780
Quantitative Certification of Bias in Large Language Models
Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta, Gagandeep Singh
Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate LLM bias, as they can not scale to large number of inputs and provide no guarantees. Therefore, we propose the first framework, QuaCer-B that certifies LLMs for bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of prompts mentioning various demographic groups, sampled from a distribution. We illustrate the bias certification for distributions of prompts created by applying varying prefixes drawn from a prefix distributions, to a given set of prompts. We consider prefix distributions for random token sequences, mixtures of manual jailbreaks, and jailbreaks in the LLM's embedding space to certify bias. We obtain non-trivial certified bounds on the probability of unbiased responses of SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive distributions of prefixes.
Submitted: May 29, 2024