Synthetic Dataset
Synthetic datasets are artificial datasets designed to mimic the statistical properties of real-world data, primarily aiming to address data scarcity, privacy concerns, or high annotation costs in various machine learning applications. Current research focuses on improving the fidelity and diversity of synthetic data using generative models like variational autoencoders, generative adversarial networks, and diffusion models, often incorporating techniques like knowledge distillation and trajectory matching to enhance efficiency and effectiveness. The development and validation of high-quality synthetic datasets are crucial for advancing machine learning in fields like healthcare, robotics, and remote sensing, where acquiring sufficient real data is challenging or ethically problematic.
Papers
Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses
Elena Sizikova, Niloufar Saharkhiz, Diksha Sharma, Miguel Lago, Berkman Sahiner, Jana G. Delfino, Aldo Badano
TarGEN: Targeted Data Generation with Large Language Models
Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra