Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Aligning Teacher with Student Preferences for Tailored Training Data Generation
Yantao Liu, Zhao Zhang, Zijun Yao, Shulin Cao, Lei Hou, Juanzi Li
Time Matters: Scaling Laws for Any Budget
Itay Inbar, Luke Sernau
Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang
Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels
Nicholas Pangakis, Samuel Wolken
Distributed Training of Large Graph Neural Networks with Variable Communication Rates
Juan Cervino, Md Asadullah Turja, Hesham Mostafa, Nageen Himayat, Alejandro Ribeiro
DataFreeShield: Defending Adversarial Attacks without Training Data
Hyeyoon Lee, Kanghyun Choi, Dain Kwon, Sunjong Park, Mayoore Selvarasa Jaiswal, Noseong Park, Jonghyun Choi, Jinho Lee
DEM: Distribution Edited Model for Training with Mixed Data Distributions
Dhananjay Ram, Aditya Rawal, Momchil Hardalov, Nikolaos Pappas, Sheng Zha
Towards Exact Gradient-based Training on Analog In-memory Computing
Zhaoxian Wu, Tayfun Gokmen, Malte J. Rasch, Tianyi Chen
Extracting Training Data from Unconditional Diffusion Models
Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang Jiang
Enhancing Spatio-temporal Quantile Forecasting with Curriculum Learning: Lessons Learned
Du Yin, Jinliang Deng, Shuang Ao, Zechen Li, Hao Xue, Arian Prabowo, Renhe Jiang, Xuan Song, Flora Salim
The Heterophilic Snowflake Hypothesis: Training and Empowering GNNs for Heterophilic Graphs
Kun Wang, Guibin Zhang, Xinnan Zhang, Junfeng Fang, Xun Wu, Guohao Li, Shirui Pan, Wei Huang, Yuxuan Liang
Soft Prompting for Unlearning in Large Language Models
Karuna Bhaila, Minh-Hao Van, Xintao Wu
Measuring memorization in RLHF for code completion
Aneesh Pappu, Billy Porter, Ilia Shumailov, Jamie Hayes
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, Shiyang Feng, Bin Wang, Chao Xu, Conghui He, Pinlong Cai, Min Dou, Botian Shi, Sheng Zhou, Yongwei Wang, Bin Wang, Junchi Yan, Fei Wu, Yu Qiao
FullCert: Deterministic End-to-End Certification for Training and Inference of Neural Networks
Tobias Lorenz, Marta Kwiatkowska, Mario Fritz
Large Language Model Tokenizer Bias: A Case Study and Solution on GPT-4o
Jin Yang, Zhiqiang Wang, Yanbin Lin, Zunduo Zhao
The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences
Bria Long, Violet Xiang, Stefan Stojanov, Robert Z. Sparks, Zi Yin, Grace E. Keene, Alvin W. M. Tan, Steven Y. Feng, Chengxu Zhuang, Virginia A. Marchman, Daniel L. K. Yamins, Michael C. Frank
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein