Unsupervised Data Selection

Unsupervised data selection aims to identify the most informative subset of unlabeled data for training machine learning models, particularly in low-resource scenarios where labeled data is scarce. Current research focuses on developing effective selection criteria, often leveraging metrics like perplexity, contrastive loss ratios, or divergence measures between data distributions, and employing algorithms that incorporate these metrics to guide the selection process. This field is crucial for improving the efficiency and performance of various machine learning applications, including speech recognition, machine translation, and text-to-speech systems, by enabling the effective use of readily available unlabeled data.

Papers