Synthetic Dataset
Synthetic datasets are artificial datasets designed to mimic the statistical properties of real-world data, primarily aiming to address data scarcity, privacy concerns, or high annotation costs in various machine learning applications. Current research focuses on improving the fidelity and diversity of synthetic data using generative models like variational autoencoders, generative adversarial networks, and diffusion models, often incorporating techniques like knowledge distillation and trajectory matching to enhance efficiency and effectiveness. The development and validation of high-quality synthetic datasets are crucial for advancing machine learning in fields like healthcare, robotics, and remote sensing, where acquiring sufficient real data is challenging or ethically problematic.
Papers
Enhancing Object Detection Accuracy in Autonomous Vehicles Using Synthetic Data
Sergei Voronin, Abubakar Siddique, Muhammad Iqbal
Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai
Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat