Synthetic Data
Synthetic data generation aims to create artificial datasets that mimic the statistical properties of real-world data, addressing limitations like data scarcity, privacy concerns, and high annotation costs. Current research focuses on developing sophisticated generative models, including generative adversarial networks (GANs), energy-based models (EBMs), diffusion models, and masked language models, tailored to various data types (images, text, tabular data, audio). This rapidly evolving field significantly impacts diverse scientific domains and practical applications by enabling the training of robust machine learning models in situations where real data is insufficient or ethically problematic, ultimately improving model performance and expanding research possibilities.
Papers
FISHing in Uncertainty: Synthetic Contrastive Learning for Genetic Aberration Detection
Simon Gutwein, Martin Kampel, Sabine Taschner-Mandl, Roxane Licandro
Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis
Linda Zeng
Generative AI-based Pipeline Architecture for Increasing Training Efficiency in Intelligent Weed Control Systems
Sourav Modak, Anthony Stein
Directional anomaly detection
Oliver Urs Lenz, Matthijs van Leeuwen
Private Synthetic Text Generation with Diffusion Models
Sebastian Ochs, Ivan Habernal
Augmenting Polish Automatic Speech Recognition System With Synthetic Data
Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz
Universality of the $π^2/6$ Pathway in Avoiding Model Collapse
Apratim Dey, David Donoho
Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets
Andoni Cortés, Clemente Rodríguez, Gorka Velez, Javier Barandiarán, Marcos Nieto
Evaluating utility in synthetic banking microdata applications
Hugo E. Caceres, Ben Moews
Generating Realistic Tabular Data with Large Language Models
Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Svetha Venkatesh
Sliced-Wasserstein-based Anomaly Detection and Open Dataset for Localized Critical Peak Rebates
Julien Pallage, Bertrand Scherrer, Salma Naccache, Christophe Bélanger, Antoine Lesage-Landry
Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification
Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng
Synthetica: Large Scale Synthetic Data for Robot Perception
Ritvik Singh, Jingzhou Liu, Karl Van Wyk, Yu-Wei Chao, Jean-Francois Lafleche, Florian Shkurti, Nathan Ratliff, Ankur Handa
zGAN: An Outlier-focused Generative Adversarial Network For Realistic Synthetic Data Generation
Azizjon Azimi, Bonu Boboeva, Ilyas Varshavskiy, Shuhrat Khalilbekov, Akhlitdin Nizamitdinov, Najima Noyoftova, Sergey Shulgin
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training
Michael Pieler, Marco Bellagente, Hannah Teufel, Duy Phung, Nathan Cooper, Jonathan Tow, Paulo Rocha, Reshinth Adithyan, Zaid Alyafeai, Nikhil Pinnaparaju, Maksym Zhuravinskyi, Carlos Riquelme