Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Add-SD: Rational Generation without Manual Reference
Lingfeng Yang, Xinyu Zhang, Xiang Li, Jinwen Chen, Kun Yao, Gang Zhang, Errui Ding, Lingqiao Liu, Jingdong Wang, Jian Yang
Federated Knowledge Recycling: Privacy-Preserving Synthetic Data Sharing
Eugenio Lomurno, Matteo Matteucci
SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models
Zheng Liu, Hao Liang, Xijie Huang, Wentao Xiong, Qinhan Yu, Linzhuang Sun, Chong Chen, Conghui He, Bin Cui, Wentao Zhang
Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription
Mickaël Zehren, Marco Alunno, Paolo Bientinesi
Navigating the United States Legislative Landscape on Voice Privacy: Existing Laws, Proposed Bills, Protection for Children, and Synthetic Data for AI
Satwik Dutta, John H. L. Hansen
Self-Directed Synthetic Dialogues and Revisions Technical Report
Nathan Lambert, Hailey Schoelkopf, Aaron Gokaslan, Luca Soldaini, Valentina Pyatkin, Louis Castricato
On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures
Benedikt Hilmes, Nick Rossenbach, and Ralf Schlüter