Corpus Bias

Corpus bias, the systematic skew in data used to train machine learning models, significantly impacts the performance and fairness of these systems. Current research focuses on identifying and mitigating this bias across various modalities, including speech, text, and images, often employing techniques like counterfactual learning and self-supervised methods within architectures such as transformers and deep learning models. Addressing corpus bias is crucial for developing reliable and unbiased AI systems, impacting fields ranging from natural language processing and speech recognition to computer vision, and promoting ethical considerations in AI development. The creation of specialized datasets, like those focused on gender bias in Chinese text, highlights the ongoing effort to improve data quality and reduce bias.

Papers