Paper ID: 2309.04318
Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research
Sjoerd de Vries, Dirk Thierens
In many real-world classification tasks, label noise is an unavoidable issue that adversely affects the generalization error of machine learning models. Additionally, evaluating how methods handle such noise is complicated, as the effect label noise has on their performance cannot be accurately quantified without clean labels. Existing research on label noise typically relies on either noisy or oversimplified simulated data as a baseline, into which additional noise with known properties is injected. In this paper, we introduce SYNLABEL, a framework designed to address these limitations by creating noiseless datasets informed by real-world data. SYNLABEL supports defining a pre-specified or learned function as the ground truth function, which can then be used for generating new clean labels. Furthermore, by repeatedly resampling values for selected features within the domain of the function, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. These distributions capture the inherent uncertainty present in many real-world datasets and enable the direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity, into which various types of noise can be introduced. Additionally, they facilitate research into soft label learning and related applications. We demonstrate the application of SYNLABEL, showcasing its ability to precisely quantify label noise and its improvement over existing methodologies.
Submitted: Sep 8, 2023