Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal
Efficient human-in-loop deep learning model training with iterative refinement and statistical result validation
Manuel Zahn, Douglas P. Perrin
Poster: Link between Bias, Node Sensitivity and Long-Tail Distribution in trained DNNs
Mahum Naseer, Muhammad Shafique
A semi-automatic method for document classification in the shipping industry
Narayanan Arvind
RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition
Igor Markov, Sergey Nesteruk, Andrey Kuznetsov, Denis Dimitrov
On the Query Complexity of Training Data Reconstruction in Private Learning
Prateeti Mukherjee, Satya Lokam
Improving Code Generation by Training with Natural Language Feedback
Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez
Automated wildlife image classification: An active learning tool for ecological applications
Ludwig Bothmann, Lisa Wimmer, Omid Charrakh, Tobias Weber, Hendrik Edelhoff, Wibke Peters, Hien Nguyen, Caryl Benjamin, Annette Menzel
From Single-Hospital to Multi-Centre Applications: Enhancing the Generalisability of Deep Learning Models for Adverse Event Prediction in the ICU
Patrick Rockenschaub, Adam Hilbert, Tabea Kossen, Falk von Dincklage, Vince Istvan Madai, Dietmar Frey
Zero-Shot Composed Image Retrieval with Textual Inversion
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo
Automatic Generation of Labeled Data for Video-Based Human Pose Analysis via NLP applied to YouTube Subtitles
Sebastian Dill, Susi Zhihan, Maurice Rohr, Maziar Sharbafi, Christoph Hoog Antink
Enriching Neural Network Training Dataset to Improve Worst-Case Performance Guarantees
Rahul Nellikkath, Spyros Chatzivasileiadis