Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang
Hi5: 2D Hand Pose Estimation with Zero Human Annotation
Masum Hasan, Cengiz Ozel, Nina Long, Alexander Martin, Samuel Potter, Tariq Adnan, Sangwu Lee, Amir Zadeh, Ehsan Hoque
PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs
Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar
Synthetic Data Outliers: Navigating Identity Disclosure
Carolina Trindade, Luís Antunes, Tânia Carvalho, Nuno Moniz
Meta-Designing Quantum Experiments with Language Models
Sören Arlt, Haonan Duan, Felix Li, Sang Michael Xie, Yuhuai Wu, Mario Krenn
PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection
Ronghui Xu, Hao Miao, Senzhang Wang, Philip S. Yu, Jianxin Wang
CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting
Jianrong Ding, Zhanyu Liu, Guanjie Zheng, Haiming Jin, Linghe Kong
Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data
Haolong Li, Yu Ma, Yinqi Zhang, Chen Ye, Jie Chen
Enhancing Clinical Documentation with Synthetic Data: Leveraging Generative Models for Improved Accuracy
Anjanava Biswas, Wrick Talukdar
An expert-driven data generation pipeline for histological images
Roberto Basla, Loris Giulivi, Luca Magri, Giacomo Boracchi
Visual Car Brand Classification by Implementing a Synthetic Image Dataset Creation Pipeline
Jan Lippemeier, Stefanie Hittmeyer, Oliver Niehörster, Markus Lange-Hegermann
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities
Maria F. Davila R., Sven Groen, Fabian Panse, Wolfram Wingerath
Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis
Seunghwan An, Gyeongdong Woo, Jaesung Lim, ChangHyun Kim, Sungchul Hong, Jong-June Jeon
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images
Krishnakant Singh, Thanush Navaratnam, Jannik Holmer, Simone Schaub-Meyer, Stefan Roth
Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology
Frank A. Ruis, Alma M. Liezenga, Friso G. Heslinga, Luca Ballan, Thijs A. Eker, Richard J. M. den Hollander, Martin C. van Leeuwen, Judith Dijk, Wyke Huizinga