Dataset Synthesis

Dataset synthesis focuses on creating artificial datasets to augment or replace real-world data for training machine learning models, addressing challenges like data scarcity, cost, and annotation effort. Current research explores various synthesis methods, including those leveraging large language models for text data, diffusion models for image generation, and techniques that incorporate retrieval augmentation or contrastive learning to improve data quality and diversity. This field is crucial for advancing machine learning in data-limited domains, enabling efficient model training and potentially unlocking applications in areas like medical imaging, manufacturing process optimization, and code model fine-tuning.

Papers