Parallel Data

Parallel data, consisting of paired texts or other data modalities in multiple languages, is crucial for training effective machine translation and other cross-lingual natural language processing models. Current research focuses on improving model efficiency through techniques like temporal parallelism in spiking neural networks and memory deduplication in tensor parallelism, as well as developing methods to leverage limited parallel data, including using pseudo-parallel data, and augmenting existing datasets with carefully selected in-domain sentences. The availability and effective utilization of parallel data significantly impact the performance of multilingual models, enabling advancements in machine translation, text detoxification, and other applications requiring cross-lingual understanding.

Papers