Parallel Corpus

Parallel corpora, collections of texts in two or more languages that are aligned at the sentence or phrase level, are crucial resources for training and evaluating machine translation (MT) systems and other multilingual natural language processing (NLP) tasks. Current research focuses on improving the quality and quantity of parallel corpora, including methods for augmentation, domain-specific creation, and filtering noisy data, often leveraging techniques like masked language models and sentence embeddings. The availability and quality of parallel corpora significantly impact the performance of multilingual NLP models, particularly for low-resource languages, and thus are essential for advancing both research and practical applications like cross-lingual communication and information access.

Papers