Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers - Page 5
Synth It Like KITTI: Synthetic Data Generation for Object Detection in Driving Scenarios
CLIPPER: Compression enables long-context synthetic data generation
PREM: Privately Answering Statistical Queries with Relative Error
Data-Constrained Synthesis of Training Data for De-Identification
Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation
Private Text Generation by Seeding Large Language Model Prompts
Does Training with Synthetic Data Truly Protect Privacy?
Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection
Frequency-domain alignment of heterogeneous, multidimensional separations data through complex orthogonal Procrustes analysis
LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data
Zero-shot generation of synthetic neurosurgical data with large language models
DiffRenderGAN: Addressing Training Data Scarcity in Deep Segmentation Networks for Quantitative Nanomaterial Analysis through Differentiable Rendering and Generative Modelling
Escaping Collapse: The Strength of Weak Data for Large Language Model Training