Multimodal Machine Translation

Multimodal machine translation (MMT) aims to improve the accuracy and fluency of machine translation by incorporating visual information, such as images or videos, alongside textual input. Current research focuses on developing more efficient training methods, addressing data scarcity through zero-shot learning and improved dataset design (e.g., incorporating ambiguity), and exploring various model architectures, including Transformer-based approaches and those leveraging pre-trained vision-language models. These advancements hold significant potential for enhancing translation quality, particularly in scenarios with ambiguous language or where visual context is crucial for disambiguation, impacting fields like cross-lingual communication and information access.

Papers