Paper ID: 2209.02965

Risk of Bias in Chest Radiography Deep Learning Foundation Models

Ben Glocker, Charles Jones, Melanie Roschewitz, Stefan Winzeck

Purpose: To analyze a recently published chest radiography foundation model for the presence of biases that could lead to subgroup performance disparities across biological sex and race. Materials and Methods: This retrospective study used 127,118 chest radiographs from 42,884 patients (mean age, 63 [SD] 17 years; 23,623 male, 19,261 female) from the CheXpert dataset collected between October 2002 and July 2017. To determine the presence of bias in features generated by a chest radiography foundation model and baseline deep learning model, dimensionality reduction methods together with two-sample Kolmogorov-Smirnov tests were used to detect distribution shifts across sex and race. A comprehensive disease detection performance analysis was then performed to associate any biases in the features to specific disparities in classification performance across patient subgroups. Results: Ten out of twelve pairwise comparisons across biological sex and race showed statistically significant differences in the studied foundation model, compared with four significant tests in the baseline model. Significant differences were found between male and female (P < .001) and Asian and Black patients (P < .001) in the feature projections that primarily capture disease. Compared with average model performance across all subgroups, classification performance on the 'no finding' label dropped between 6.8% and 7.8% for female patients, and performance in detecting 'pleural effusion' dropped between 10.7% and 11.6% for Black patients. Conclusion: The studied chest radiography foundation model demonstrated racial and sex-related bias leading to disparate performance across patient subgroups and may be unsafe for clinical applications.

Submitted: Sep 7, 2022