Synthetic Tabular Data
Synthetic tabular data generation aims to create artificial datasets that mimic the statistical properties of real data while addressing issues like data scarcity, privacy concerns, and bias. Current research focuses on improving the fidelity and utility of synthetic data using various generative models, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), diffusion models, and increasingly, large language models (LLMs), often incorporating techniques like transfer learning and conditional generation to enhance realism and preserve complex relationships between features. This field is significant because high-quality synthetic data can enable broader data sharing, augment limited datasets for improved machine learning model training, and facilitate research in sensitive domains where access to real data is restricted.
Papers
CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding
Riya Samanta, Bidyut Saha, Soumya K. Ghosh, Sajal K. Das
EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding
Muye Huang, Lai Han, Xinyu Zhang, Wenjun Wu, Jie Ma, Lingling Zhang, Jun Liu