Represented Language
Represented language research focuses on extending the capabilities of natural language processing (NLP) models, particularly large language models (LLMs), to encompass the world's diverse languages, especially those historically under-represented in digital data. Current research emphasizes developing data-efficient methods for training and adapting LLMs to low-resource languages, often leveraging techniques like cross-lingual transfer learning, data augmentation, and multilingual model architectures such as BERT and ByT5. This work is crucial for promoting linguistic diversity and inclusivity in NLP, enabling broader access to technological advancements and fostering equitable development of language technologies across communities.
Papers
Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages
Clarissa Forbes, Farhan Samir, Bruce Harold Oliver, Changbing Yang, Edith Coates, Garrett Nicolai, Miikka Silfverberg
Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation
Xinyi Wang, Sebastian Ruder, Graham Neubig