Bilingual Data

Bilingual data research focuses on developing and utilizing datasets containing parallel text or speech in two languages to improve multilingual natural language processing (NLP) models. Current research emphasizes creating high-quality bilingual corpora for various domains (e.g., finance, medicine, general knowledge), often employing large language models (LLMs) for tasks like translation, question answering, and safety detection. This work is crucial for advancing multilingual NLP capabilities, particularly for low-resource languages, and has significant implications for cross-cultural communication and information access.

Papers