Synthetic Tabular Data Generation

Synthetic tabular data generation aims to create artificial datasets that statistically resemble real data, addressing issues like data scarcity, privacy concerns, and class imbalances. Current research focuses on improving the fidelity of generated data, particularly by enhancing the preservation of dependencies between attributes and handling mixed data types, employing models like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion probabilistic models, as well as exploring the potential and limitations of large language models. This field is significant because high-quality synthetic data can enable responsible data sharing, augment existing datasets for improved machine learning performance, and facilitate research in areas with limited access to real data.

Papers