Calibration Attack

Calibration attacks are adversarial attacks designed to manipulate the confidence scores of machine learning models without altering their predictions, thereby undermining the trustworthiness of model outputs. Current research focuses on developing methods to both launch these attacks (e.g., creating under- or over-confident models) and defend against them, often employing adversarial training techniques and analyzing the impact on metrics like expected calibration error. This research is crucial for ensuring the reliability of machine learning systems in high-stakes applications where accurate confidence estimates are essential for safe and effective decision-making.

Papers