Data Imbalance

Data imbalance, where some classes in a dataset are significantly under-represented compared to others, poses a major challenge for machine learning models, leading to biased predictions and poor performance on minority classes. Current research focuses on mitigating this imbalance through various techniques, including data augmentation (e.g., synthetic oversampling using LLMs), algorithmic modifications (e.g., cost-sensitive learning, novel loss functions like LDAM and IWL), and ensemble methods, often applied within architectures like XGBoost, graph neural networks, and deep neural networks. Addressing data imbalance is crucial for improving the fairness, reliability, and generalizability of machine learning models across diverse applications, from medical diagnosis and fraud detection to environmental monitoring and materials science.

Papers