Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
Many-to-many Image Generation with Auto-regressive Diffusion Models
Ying Shen, Yizhe Zhang, Shuangfei Zhai, Lifu Huang, Joshua M. Susskind, Jiatao Gu
ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model
Osvaldo Luamba Quinjica, David Ifeoluwa Adelani
Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data
Parth Patwa, Simone Filice, Zhiyu Chen, Giuseppe Castellucci, Oleg Rokhlenko, Shervin Malmasi
Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization
Yuhang Li, Xin Dong, Chen Chen, Jingtao Li, Yuxin Wen, Michael Spranger, Lingjuan Lyu
Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data
Shan Chen, Jack Gallifant, Marco Guevara, Yanjun Gao, Majid Afshar, Timothy Miller, Dmitriy Dligach, Danielle S. Bitterman
Debiasing Cardiac Imaging with Controlled Latent Diffusion Models
Grzegorz Skorupko, Richard Osuala, Zuzanna Szafranowska, Kaisar Kushibar, Nay Aung, Steffen E Petersen, Karim Lekadir, Polyxeni Gkontra
InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection
Thales Bertaglia, Lily Heisig, Rishabh Kaushal, Adriana Iamnitchi
SYNCS: Synthetic Data and Contrastive Self-Supervised Training for Central Sulcus Segmentation
Vladyslav Zalevskyi, Kristoffer Hougaard Madsen
Improving cross-domain brain tissue segmentation in fetal MRI with synthetic data
Vladyslav Zalevskyi, Thomas Sanchez, Margaux Roulet, Jordina Aviles Verddera, Jana Hutter, Hamza Kebiri, Meritxell Bach Cuadra