Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Discursive objection strategies in online comments: Developing a classification schema and validating its training
Ashley L. Shea, Aspen K. B. Omapang, Ji Yong Cho, Miryam Y. Ginsparg, Natalie Bazarova, Winice Hui, René F. Kizilcec, Chau Tong, Drew Margolin
OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea
Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning
Jisu Kim, Juhwan Lee
Theoretical Guarantees of Data Augmented Last Layer Retraining Methods
Monica Welfert, Nathan Stromberg, Lalitha Sankar
Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models
Zhe Ma, Xuhong Zhang, Qingming Li, Tianyu Du, Wenzhi Chen, Zonghui Wang, Shouling Ji
On Training a Neural Network to Explain Binaries
Alexander Interrante-Grant, Andy Davis, Heather Preslier, Tim Leek
Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks
Mariia Ignashina, Julia Ive
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang