Synthetic Data Generation
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real data, addressing limitations in data availability, privacy concerns, and the high cost of data annotation. Current research focuses on developing advanced generative models, including diffusion models, generative adversarial networks, and methods leveraging large language models, to produce high-fidelity synthetic data across diverse data types (tabular, image, text, and even 3D models). This field is crucial for advancing machine learning in various domains, enabling the training of robust models in situations where real data is scarce, expensive, or sensitive, and improving the reliability and fairness of AI systems.
Papers
Decentralised, Scalable and Privacy-Preserving Synthetic Data Generation
Vishal Ramesh, Rui Zhao, Naman Goel
MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation
Chandrani Kumari, Rahul Siddharthan
Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data
Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, Rafael de Sousa
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark
Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, Andrija Petrovic
UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception
Christopher Maxey, Jaehoon Choi, Hyungtae Lee, Dinesh Manocha, Heesung Kwon
Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation
Tae Jin Park, He Huang, Coleman Hooper, Nithin Koluguri, Kunal Dhawan, Ante Jukic, Jagadeesh Balam, Boris Ginsburg
UNav-Sim: A Visually Realistic Underwater Robotics Simulator and Synthetic Data-generation Framework
Abdelhakim Amer, Olaya Álvarez-Tuñón, Halil Ibrahim Ugurlu, Jonas le Fevre Sejersen, Yury Brodskiy, Erdal Kayacan