Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Towards Foundation Time Series Model: To Synthesize Or Not To Synthesize?
Kseniia Kuvshinova, Olga Tsymboi, Alina Kostromina, Dmitry Simakov, Elizaveta Kovtun
Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground
Adil Soubki, John Murzaku, Arash Yousefi Jordehi, Peter Zeng, Magdalena Markowska, Seyed Abolghasem Mirroshandel, Owen Rambow
Differentially Private Synthetic Data via Foundation Model APIs 2: Text
Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin
A synthetic data approach for domain generalization of NLI models
Mohammad Javad Hosseini, Andrey Petrov, Alex Fabrikant, Annie Louis
Synthetic location trajectory generation using categorical diffusion models
Simon Dirmeier, Ye Hong, Fernando Perez-Cruz
Towards Theoretical Understandings of Self-Consuming Generative Models
Shi Fu, Sen Zhang, Yingjie Wang, Xinmei Tian, Dacheng Tao
Online Differentially Private Synthetic Data Generation
Yiyun He, Roman Vershynin, Yizhe Zhu
Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs
Víctor Gallego
Detecting the Clinical Features of Difficult-to-Treat Depression using Synthetic Data from Large Language Models
Isabelle Lorge, Dan W. Joyce, Niall Taylor, Alejo Nevado-Holgado, Andrea Cipriani, Andrey Kormilitzin