Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation
Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E. Gonzalez, Trevor Darrell
Scaling Data-Constrained Language Models
Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel
Training Data Extraction From Pre-trained Language Models: A Survey
Shotaro Ishihara
Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition
Dongnan Gui, Kai Chen, Haisong Ding, Qiang Huo
Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and Training with Non-Linear Diffusion
Dai Shi, Zhiqi Shao, Yi Guo, Qibin Zhao, Junbin Gao
Training on Thin Air: Improve Image Classification with Generated Data
Yongchao Zhou, Hshmat Sahak, Jimmy Ba
Training Energy-Based Normalizing Flow with Score-Matching Objectives
Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, Chun-Yi Lee
Promoting Generalization in Cross-Dataset Remote Photoplethysmography
Nathan Vance, Jeremy Speth, Benjamin Sporrer, Patrick Flynn
Injecting Knowledge into Biomedical Pre-trained Models via Polymorphism and Synonymous Substitution
Hongbo Zhang, Xiang Wan, Benyou Wang
Extracting Psychological Indicators Using Question Answering
Luka Pavlović
Exploiting Correlations Between Contexts and Definitions with Multiple Definition Modeling
Linhan Zhang, Qian Chen, Wen Wang, Yuxin Jiang, Bing Li, Wei Wang, Xin Cao
FaceFusion: Exploiting Full Spectrum of Multiple Datasets
Chiyoung Song, Dongjae Lee
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, Daphne Ippolito
Evaluation of medium-large Language Models at zero-shot closed book generative question answering
René Peinl, Johannes Wirth
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning
Mustafa Safa Ozdayi, Charith Peris, Jack FitzGerald, Christophe Dupuy, Jimit Majmudar, Haidar Khan, Rahil Parikh, Rahul Gupta
Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment
Ryo Nagata, Hiroya Takamura, Naoki Otani, Yoshifumi Kawasaki
LIMA: Less Is More for Alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy
Segment Any Anomaly without Training via Hybrid Prompt Regularization
Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Zongwei Du, Liang Gao, Weiming Shen