Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu
Preventing Arbitrarily High Confidence on Far-Away Data in Point-Estimated Discriminative Neural Networks
Ahmad Rashid, Serena Hacker, Guojun Zhang, Agustinus Kristiadi, Pascal Poupart
Exploring Practitioner Perspectives On Training Data Attribution Explanations
Elisa Nguyen, Evgenii Kortukov, Jean Y. Song, Seong Joon Oh
From Denoising Training to Test-Time Adaptation: Enhancing Domain Generalization for Medical Image Segmentation
Ruxue Wen, Hangjie Yuan, Dong Ni, Wenbo Xiao, Yaoyao Wu
FPGAN-Control: A Controllable Fingerprint Generator for Training with Synthetic Data
Alon Shoshan, Nadav Bhonker, Emanuel Ben Baruch, Ori Nizan, Igor Kviatkovsky, Joshua Engelsma, Manoj Aggarwal, Gerard Medioni
TRIAGE: Characterizing and auditing training data for improved regression
Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar
Debiasing Algorithm through Model Adaptation
Tomasz Limisiewicz, David Mareček, Tomáš Musil
Critic-Driven Decoding for Mitigating Hallucinations in Data-to-text Generation
Mateusz Lango, Ondřej Dušek
RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
Bowen Tan, Yun Zhu, Lijuan Liu, Hongyi Wang, Yonghao Zhuang, Jindong Chen, Eric Xing, Zhiting Hu
Adversarial sample generation and training using geometric masks for accurate and resilient license plate character recognition
Bishal Shrestha, Griwan Khakurel, Kritika Simkhada, Badri Adhikari
Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models
Jaeyoung Choe, Keonwoong Noh, Nayeon Kim, Seyun Ahn, Woohwan Jung
Assessing Privacy Risks in Language Models: A Case Study on Summarization Tasks
Ruixiang Tang, Gord Lueck, Rodolfo Quispe, Huseyin A Inan, Janardhan Kulkarni, Xia Hu