Multimodal Machine Translation
Multimodal machine translation (MMT) aims to improve the accuracy and fluency of machine translation by incorporating visual information, such as images or videos, alongside textual input. Current research focuses on developing more efficient training methods, addressing data scarcity through zero-shot learning and improved dataset design (e.g., incorporating ambiguity), and exploring various model architectures, including Transformer-based approaches and those leveraging pre-trained vision-language models. These advancements hold significant potential for enhancing translation quality, particularly in scenarios with ambiguous language or where visual context is crucial for disambiguation, impacting fields like cross-lingual communication and information access.
Papers
Detecting Concrete Visual Tokens for Multimodal Machine Translation
Braeden Bowen, Vipin Vijayan, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup
Adding Multimodal Capabilities to a Text-only Translation Model
Vipin Vijayan, Braeden Bowen, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup
The Case for Evaluating Multimodal Translation Models on Text Datasets
Vipin Vijayan, Braeden Bowen, Scott Grigsby, Timothy Anderson, Jeremy Gwinnup