Human Generated Data

Human-generated data is crucial for training machine learning models, particularly large language models (LLMs), but its limitations—including bias, scarcity in specific domains, and high annotation costs—drive current research. Active areas focus on developing methods to generate high-quality synthetic data, assessing and mitigating biases in existing datasets and models, and exploring alternative data sources like robot-collected data or game-generated data to supplement or replace human-annotated data. This research is vital for improving the accuracy, fairness, and scalability of AI systems across diverse applications, from medical diagnosis to content generation and beyond.

Papers