Dialect Gap

The "dialect gap" refers to the performance disparity of natural language processing (NLP) models across different language varieties, particularly when applied to dialects underrepresented in training data. Current research focuses on quantifying this gap across various languages and NLP tasks (like machine translation and speech recognition), employing large language models (LLMs) and exploring methods like synthetic data generation and model merging to improve performance on under-resourced dialects. Addressing this gap is crucial for ensuring equitable access to NLP technologies and mitigating biases, particularly concerning social implications like potential prejudice in AI decision-making based on dialect.

Papers