Multilingual Corpus

Multilingual corpora, collections of text and speech data spanning multiple languages, are crucial for developing language technologies that work across linguistic boundaries. Current research focuses on creating and improving these corpora, addressing issues like data imbalance, bias detection, and efficient cross-lingual transfer learning using techniques such as deep learning models (e.g., BERT, mT5) and contrastive learning. These advancements are vital for bridging the language gap in natural language processing, enabling applications like multilingual machine translation, speech recognition, and information retrieval to serve a wider global population and fostering research into under-resourced languages.

Papers