Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Towards Biologically Plausible and Private Gene Expression Data Generation
Dingfan Chen, Marie Oestreich, Tejumade Afonja, Raouf Kerkouche, Matthias Becker, Mario Fritz
How Realistic Is Your Synthetic Data? Constraining Deep Generative Models for Tabular Data
Mihaela Cătălina Stoian, Salijona Dyrmishi, Maxime Cordy, Thomas Lukasiewicz, Eleonora Giunchiglia
Group Distributionally Robust Dataset Distillation with Risk Minimization
Saeed Vahidian, Mingyu Wang, Jianyang Gu, Vyacheslav Kungurtsev, Wei Jiang, Yiran Chen
CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines
Chao Pang, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Elise L. Minto, Jason Patterson, Linying Zhang, George Hripcsak, Gamze Gürsoy, Noémie Elhadad, Karthik Natarajan
Bounding the Excess Risk for Linear Models Trained on Marginal-Preserving, Differentially-Private, Synthetic Data
Yvonne Zhou, Mingyu Liang, Ivan Brugere, Dana Dachman-Soled, Danial Dervovic, Antigoni Polychroniadou, Min Wu
A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets
Ossi Räisä, Antti Honkela
SynthDST: Synthetic Data is All You Need for Few-Shot Dialog State Tracking
Atharva Kulkarni, Bo-Hsiang Tseng, Joel Ruben Antony Moniz, Dhivya Piraviperumal, Hong Yu, Shruti Bhargava
SudokuSens: Enhancing Deep Learning Robustness for IoT Sensing Applications using a Generative Approach
Tianshi Wang, Jinyang Li, Ruijie Wang, Denizhan Kara, Shengzhong Liu, Davis Wertheimer, Antoni Viros-i-Martin, Raghu Ganti, Mudhakar Srivatsa, Tarek Abdelzaher
From Synthetic to Real: Unveiling the Power of Synthetic Data for Video Person Re-ID
Xiangqun Zhang, Wei Feng, Ruize Han, Likai Wang, Linqi Song, Junhui Hou
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?
Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem
Synthetic Data for the Mitigation of Demographic Biases in Face Recognition
Pietro Melzi, Christian Rathgeb, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Dominik Lawatsch, Florian Domin, Maxim Schaubert
De-identification is not always enough
Atiquer Rahman Sarkar, Yao-Shun Chuang, Noman Mohammed, Xiaoqian Jiang
VR-based generation of photorealistic synthetic data for training hand-object tracking models
Chengyan Zhang, Rahul Chaudhari
A primer on synthetic health data
Jennifer Anne Bartell, Sander Boisen Valentin, Anders Krogh, Henning Langberg, Martin Bøgsted