Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Synthetic Data in Radiological Imaging: Current State and Future Outlook
Elena Sizikova, Andreu Badal, Jana G. Delfino, Miguel Lago, Brandon Nelson, Niloufar Saharkhiz, Berkman Sahiner, Ghada Zamzmi, Aldo Badano
Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks
Chancellor R. Woolsey, Prakash Bisht, Joshua Rothman, Gondy Leroy
Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets
Leila Nombo, Anne-Sophie Charest
Synthetic Data from Diffusion Models Improve Drug Discovery Prediction
Bing Hu, Ashish Saragadam, Anita Layton, Helen Chen
Differentially Private Synthetic Data with Private Density Estimation
Nikolija Bojkovic, Po-Ling Loh
Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data
Leonhard Hennicke, Christian Medeiros Adriano, Holger Giese, Jan Mathias Koehler, Lukas Schott
Synthetic Face Datasets Generation via Latent Space Exploration from Brownian Identity Diffusion
David Geissbühler, Hatef Otroshi Shahreza, Sébastien Marcel
Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Shivam Mehta, Anna Deichler, Jim O'Regan, Birger Moëll, Jonas Beskow, Gustav Eje Henter, Simon Alexanderson
WheelPose: Data Synthesis Techniques to Improve Pose Estimation Performance on Wheelchair Users
William Huang, Sam Ghahremani, Siyou Pei, Yang Zhang
Auto-Generating Weak Labels for Real & Synthetic Data to Improve Label-Scarce Medical Image Segmentation
Tanvi Deshpande, Eva Prakash, Elsie Gyang Ross, Curtis Langlotz, Andrew Ng, Jeya Maria Jose Valanarasu
Privacy-Preserving Statistical Data Generation: Application to Sepsis Detection
Eric Macias-Fassio, Aythami Morales, Cristina Pruenza, Julian Fierrez
Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data
Niclas Popp, Jan Hendrik Metzen, Matthias Hein
Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums
Isabelle Lorge, Dan W. Joyce, Andrey Kormilitzin