Parallel Corpus
Parallel corpora, collections of texts in two or more languages that are aligned at the sentence or phrase level, are crucial resources for training and evaluating machine translation (MT) systems and other multilingual natural language processing (NLP) tasks. Current research focuses on improving the quality and quantity of parallel corpora, including methods for augmentation, domain-specific creation, and filtering noisy data, often leveraging techniques like masked language models and sentence embeddings. The availability and quality of parallel corpora significantly impact the performance of multilingual NLP models, particularly for low-resource languages, and thus are essential for advancing both research and practical applications like cross-lingual communication and information access.
Papers
The VolcTrans System for WMT22 Multilingual Machine Translation Task
Xian Qian, Kai Hu, Jiaqiang Wang, Yifeng Liu, Xingyuan Pan, Jun Cao, Mingxuan Wang
Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS
Chunyu Qiang, Jianhua Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang
Language Agnostic Multilingual Information Retrieval with Contrastive Learning
Xiyang Hu, Xinchi Chen, Peng Qi, Deguang Kong, Kunlun Liu, William Yang Wang, Zhiheng Huang
Improved Data Augmentation for Translation Suggestion
Hongxiao Zhang, Siyu Lai, Songming Zhang, Hui Huang, Yufeng Chen, Jinan Xu, Jian Liu