Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai
Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments
Zachari Thiry, Massimiliano Ruocco, Alessandro Nocente, Michail Spitieris
Mitigating Bias in Dataset Distillation
Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh
What is Dataset Distillation Learning?
William Yang, Ye Zhu, Zhiwei Deng, Olga Russakovsky
Lean Workbook: A large-scale Lean problem set formalized from natural language math problems
Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, Kai Chen
Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang
Hi5: 2D Hand Pose Estimation with Zero Human Annotation
Masum Hasan, Cengiz Ozel, Nina Long, Alexander Martin, Samuel Potter, Tariq Adnan, Sangwu Lee, Amir Zadeh, Ehsan Hoque
PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs
Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar
Synthetic Data Outliers: Navigating Identity Disclosure
Carolina Trindade, Luís Antunes, Tânia Carvalho, Nuno Moniz
Meta-Designing Quantum Experiments with Language Models
Sören Arlt, Haonan Duan, Felix Li, Sang Michael Xie, Yuhuai Wu, Mario Krenn
PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection
Ronghui Xu, Hao Miao, Senzhang Wang, Philip S. Yu, Jianxin Wang
CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting
Jianrong Ding, Zhanyu Liu, Guanjie Zheng, Haiming Jin, Linghe Kong
Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data
Haolong Li, Yu Ma, Yinqi Zhang, Chen Ye, Jie Chen
Enhancing Clinical Documentation with Synthetic Data: Leveraging Generative Models for Improved Accuracy
Anjanava Biswas, Wrick Talukdar
An expert-driven data generation pipeline for histological images
Roberto Basla, Loris Giulivi, Luca Magri, Giacomo Boracchi
Visual Car Brand Classification by Implementing a Synthetic Image Dataset Creation Pipeline
Jan Lippemeier, Stefanie Hittmeyer, Oliver Niehörster, Markus Lange-Hegermann
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities
Maria F. Davila R., Sven Groen, Fabian Panse, Wolfram Wingerath
Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis
Seunghwan An, Gyeongdong Woo, Jaesung Lim, ChangHyun Kim, Sungchul Hong, Jong-June Jeon