Synthetic Dataset
Synthetic datasets are artificial datasets designed to mimic the statistical properties of real-world data, primarily aiming to address data scarcity, privacy concerns, or high annotation costs in various machine learning applications. Current research focuses on improving the fidelity and diversity of synthetic data using generative models like variational autoencoders, generative adversarial networks, and diffusion models, often incorporating techniques like knowledge distillation and trajectory matching to enhance efficiency and effectiveness. The development and validation of high-quality synthetic datasets are crucial for advancing machine learning in fields like healthcare, robotics, and remote sensing, where acquiring sufficient real data is challenging or ethically problematic.
Papers
How Realistic Is Your Synthetic Data? Constraining Deep Generative Models for Tabular Data
Mihaela Cătălina Stoian, Salijona Dyrmishi, Maxime Cordy, Thomas Lukasiewicz, Eleonora Giunchiglia
Group Distributionally Robust Dataset Distillation with Risk Minimization
Saeed Vahidian, Mingyu Wang, Jianyang Gu, Vyacheslav Kungurtsev, Wei Jiang, Yiran Chen
Improved Data Generation for Enhanced Asset Allocation: A Synthetic Dataset Approach for the Fixed Income Universe
Szymon Kubiak, Tillman Weyde, Oleksandr Galkin, Dan Philps, Ram Gopal
Syn3DWound: A Synthetic Dataset for 3D Wound Bed Analysis
Léo Lebrat, Rodrigo Santa Cruz, Remi Chierchia, Yulia Arzhaeva, Mohammad Ali Armin, Joshua Goldsmith, Jeremy Oorloff, Prithvi Reddy, Chuong Nguyen, Lars Petersson, Michelle Barakat-Johnson, Georgina Luscombe, Clinton Fookes, Olivier Salvado, David Ahmedt-Aristizabal