Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Active Perception using Neural Radiance Fields
Siming He, Christopher D. Hsu, Dexter Ong, Yifei Simon Shao, Pratik Chaudhari
Enhancing ML model accuracy for Digital VLSI circuits using diffusion models: A study on synthetic data generation
Prasha Srivastava, Pawan Kumar, Zia Abbas
Private Synthetic Data Meets Ensemble Learning
Haoyuan Sun, Navid Azizan, Akash Srivastava, Hao Wang
Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations
Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin
Does Synthetic Data Make Large Language Models More Efficient?
Sia Gholami, Marwan Omar
Deep Aramaic: Towards a Synthetic Data Paradigm Enabling Machine Learning in Epigraphy
Andrei C. Aioanei, Regine Hunziker-Rodewald, Konstantin Klein, Dominik L. Michels
Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images
Che Liu, Anand Shah, Wenjia Bai, Rossella Arcucci
Mitigating stereotypical biases in text to image generative systems
Piero Esposito, Parmida Atighehchian, Anastasis Germanidis, Deepti Ghadiyaram
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method
Rémy Chapelle, Bruno Falissard
Partition-based differentially private synthetic data generation
Meifan Zhang, Dihang Deng, Lihua Yin
How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound
Menghan Yu, Sourabh Kulhare, Courosh Mehanian, Charles B Delahunt, Daniel E Shea, Zohreh Laverriere, Ishan Shah, Matthew P Horning
Can pre-trained models assist in dataset distillation?
Yao Lu, Xuguang Chen, Yuchen Zhang, Jianyang Gu, Tianle Zhang, Yifan Zhang, Xiaoniu Yang, Qi Xuan, Kai Wang, Yang You