High Quality Data
High-quality data is crucial for training effective machine learning models, particularly large language models (LLMs) and multimodal models. Current research focuses on developing methods for creating, cleaning, and selecting high-quality datasets, including techniques like gamified crowdsourcing, counterfactual explanations for data augmentation, and sophisticated filtering algorithms (e.g., ensemble KenLMs) to remove noise and bias. These efforts aim to improve model performance, robustness, and trustworthiness across various applications, from autonomous driving to medical diagnosis, while addressing challenges posed by imbalanced datasets and the high cost of data annotation.
Papers
Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm?
Rui Wen, Michael Backes, Yang Zhang
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
Yejie Wang, Keqing He, Dayuan Fu, Zhuoma Gongque, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Yiping Wang, Yifang Chen, Wendan Yan, Alex Fang, Wenjing Zhou, Kevin Jamieson, Simon Shaolei Du
Can We Enhance the Quality of Mobile Crowdsensing Data Without Ground Truth?
Jiajie Li, Bo Gu, Shimin Gong, Zhou Su, Mohsen Guizani