Synthetic Dataset
Synthetic datasets are artificial datasets designed to mimic the statistical properties of real-world data, primarily aiming to address data scarcity, privacy concerns, or high annotation costs in various machine learning applications. Current research focuses on improving the fidelity and diversity of synthetic data using generative models like variational autoencoders, generative adversarial networks, and diffusion models, often incorporating techniques like knowledge distillation and trajectory matching to enhance efficiency and effectiveness. The development and validation of high-quality synthetic datasets are crucial for advancing machine learning in fields like healthcare, robotics, and remote sensing, where acquiring sufficient real data is challenging or ethically problematic.
Papers
Improved Data Generation for Enhanced Asset Allocation: A Synthetic Dataset Approach for the Fixed Income Universe
Szymon Kubiak, Tillman Weyde, Oleksandr Galkin, Dan Philps, Ram Gopal
Syn3DWound: A Synthetic Dataset for 3D Wound Bed Analysis
Léo Lebrat, Rodrigo Santa Cruz, Remi Chierchia, Yulia Arzhaeva, Mohammad Ali Armin, Joshua Goldsmith, Jeremy Oorloff, Prithvi Reddy, Chuong Nguyen, Lars Petersson, Michelle Barakat-Johnson, Georgina Luscombe, Clinton Fookes, Olivier Salvado, David Ahmedt-Aristizabal
Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses
Elena Sizikova, Niloufar Saharkhiz, Diksha Sharma, Miguel Lago, Berkman Sahiner, Jana G. Delfino, Aldo Badano
TarGEN: Targeted Data Generation with Large Language Models
Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra