Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations
Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin
Does Synthetic Data Make Large Language Models More Efficient?
Sia Gholami, Marwan Omar
Deep Aramaic: Towards a Synthetic Data Paradigm Enabling Machine Learning in Epigraphy
Andrei C. Aioanei, Regine Hunziker-Rodewald, Konstantin Klein, Dominik L. Michels
Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images
Che Liu, Anand Shah, Wenjia Bai, Rossella Arcucci
Mitigating stereotypical biases in text to image generative systems
Piero Esposito, Parmida Atighehchian, Anastasis Germanidis, Deepti Ghadiyaram
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method
Rémy Chapelle, Bruno Falissard
Partition-based differentially private synthetic data generation
Meifan Zhang, Dihang Deng, Lihua Yin
How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound
Menghan Yu, Sourabh Kulhare, Courosh Mehanian, Charles B Delahunt, Daniel E Shea, Zohreh Laverriere, Ishan Shah, Matthew P Horning
Can pre-trained models assist in dataset distillation?
Yao Lu, Xuguang Chen, Yuchen Zhang, Jianyang Gu, Tianle Zhang, Yifan Zhang, Xiaoniu Yang, Qi Xuan, Kai Wang, Yang You
Feedback-guided Data Synthesis for Imbalanced Classification
Reyhane Askari Hemmat, Mohammad Pezeshki, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano
Towards Few-Call Model Stealing via Active Self-Paced Knowledge Distillation and Diffusion-Based Image Generation
Vlad Hondru, Radu Tudor Ionescu