Tabular Data Synthesis
Tabular data synthesis aims to generate realistic synthetic datasets that preserve the statistical properties of real data while protecting privacy. Current research focuses on improving the quality and utility of synthetic data using various generative models, including Generative Adversarial Networks (GANs), diffusion models, and increasingly, large language models (LLMs), often incorporating techniques like conditional generation and differential privacy. This field is crucial for enabling data sharing and analysis in sensitive domains while mitigating privacy risks, impacting diverse applications from healthcare and finance to scientific research. A key challenge remains balancing the fidelity of synthetic data with its privacy-preserving properties.
Papers
Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities
Maria F. Davila R., Sven Groen, Fabian Panse, Wolfram Wingerath
Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis
Seunghwan An, Gyeongdong Woo, Jaesung Lim, ChangHyun Kim, Sungchul Hong, Jong-June Jeon