Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang
Constrained Posterior Sampling: Time Series Generation with Hard Constraints
Sai Shankar Narasimhan, Shubhankar Agarwal, Litu Rout, Sanjay Shakkottai, Sandeep P. Chinchali
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He
From Measurement Instruments to Data: Leveraging Theory-Driven Synthetic Training Data for Classifying Social Constructs
Lukas Birkenmaier, Matthias Roth, Indira Sen
Evaluating Utility of Memory Efficient Medical Image Generation: A Study on Lung Nodule Segmentation
Kathrin Khadra, Utku Türkbey
Privacy-Preserving Synthetically Augmented Knowledge Graphs with Semantic Utility
Luigi Bellomarini, Costanza Catalano, Andrea Coletta, Michela Iezzi, Pierangela Samarati
Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection
Yong Xie, Karan Aggarwal, Aitzaz Ahmad, Stephen Lau
Advancing Healthcare: Innovative ML Approaches for Improved Medical Imaging in Data-Constrained Environments
Al Amin, Kamrul Hasan, Saleh Zein-Sabatto, Liang Hong, Sachin Shetty, Imtiaz Ahmed, Tariqul Islam
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data
Xilin He, Cheng Luo, Xiaole Xian, Bing Li, Siyang Song, Muhammad Haris Khan, Weicheng Xie, Linlin Shen, Zongyuan Ge
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, Weijia Li
Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications
Alexey Zhezherau, Alexei Yanockin
Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory
Aymane El Firdoussi, Mohamed El Amine Seddik, Soufiane Hayou, Reda Alami, Ahmed Alzubaidi, Hakim Hacid
Driving Privacy Forward: Mitigating Information Leakage within Smart Vehicles through Synthetic Data Generation
Krish Parikh
Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains
Krithika Ramesh, Nupoor Gandhi, Pulkit Madaan, Lisa Bauer, Charith Peris, Anjalie Field
Data Augmentation for Surgical Scene Segmentation with Anatomy-Aware Diffusion Models
Danush Kumar Venkatesh, Dominik Rivoir, Micha Pfeiffer, Fiona Kolbinger, Stefanie Speidel