Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Understanding new tasks through the lens of training data via exponential tilting
Subha Maity, Mikhail Yurochkin, Moulinath Banerjee, Yuekai Sun
Training and Inference on Any-Order Autoregressive Models the Right Way
Andy Shih, Dorsa Sadigh, Stefano Ermon
On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions
Daniel Beaglehole, Mikhail Belkin, Parthe Pandit
The Document Vectors Using Cosine Similarity Revisited
Zhang Bingyu, Nikolay Arefyev