Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Decentralised, Scalable and Privacy-Preserving Synthetic Data Generation
Vishal Ramesh, Rui Zhao, Naman Goel
MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation
Chandrani Kumari, Rahul Siddharthan
Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data
Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, Rafael de Sousa
FOUND: Foot Optimization with Uncertain Normals for Surface Deformation Using Synthetic Data
Oliver Boyne, Gwangbin Bae, James Charles, Roberto Cipolla
TarGEN: Targeted Data Generation with Large Language Models
Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra
Boosting Data Analytics With Synthetic Volume Expansion
Xiaotong Shen, Yifei Liu, Rex Shen