Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Counterfactual Explanations for Multivariate Time-Series without Training Datasets
Xiangyu Sun, Raquel Aoki, Kevin H. Wilson
PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild
Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, Yansong Tang
Data Augmentation Method Utilizing Template Sentences for Variable Definition Extraction
Kotaro Nagayama, Shota Kato, Manabu Kano
Cascade of phase transitions in the training of Energy-based models
Dimitrios Bachtis, Giulio Biroli, Aurélien Decelle, Beatriz Seoane
RaFe: Ranking Feedback Improves Query Rewriting for RAG
Shengyu Mao, Yong Jiang, Boli Chen, Xiao Li, Peng Wang, Xinyu Wang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Tell me why: Training preferences-based RL with human preferences and step-level explanations
Jakob Karalus
Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making
Hanzhao Wang, Yu Pan, Fupeng Sun, Shang Liu, Kalyan Talluri, Guanting Chen, Xiaocheng Li
G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation
Xingyuan Pan, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Shanbo Cheng
EntropyStop: Unsupervised Deep Outlier Detection with Loss Entropy
Yihong Huang, Yuang Zhang, Liping Wang, Fan Zhang, Xuemin Lin
Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge
Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Khaled Koutini, Gerhard Widmer
Automating the Training and Deployment of Models in MLOps by Integrating Systems with Machine Learning
Penghao Liang, Bo Song, Xiaoan Zhan, Zhou Chen, Jiaqiang Yuan