Training Data Generator
Training data generators leverage large language models (LLMs) to create synthetic datasets for various machine learning tasks, aiming to improve model performance and address data limitations like class imbalance or spurious correlations. Current research focuses on tailoring data generation to specific model needs (e.g., knowledge distillation, few-shot learning), incorporating teacher feedback or user preferences to enhance data quality and relevance, and mitigating biases inherent in LLMs. These advancements have significant implications for improving model generalization, robustness, and efficiency across diverse applications, particularly in resource-constrained or privacy-sensitive settings.
Papers
June 27, 2024
February 24, 2024
November 16, 2023
November 15, 2023
June 28, 2023
November 6, 2022
April 7, 2022