End to End Speech Translation
End-to-end speech translation (E2E-ST) aims to directly translate spoken language into another language without intermediate steps like separate speech recognition and machine translation. Current research heavily focuses on leveraging large language models (LLMs) and neural transducers to improve translation quality and efficiency, often incorporating techniques like multi-task learning, data augmentation, and improved segmentation methods to handle long audio streams and code-switching. These advancements are significant because they promise faster, more accurate, and more robust speech translation systems, with potential applications ranging from real-time communication tools to improved accessibility for multilingual populations. The field is also actively exploring better evaluation metrics that move beyond text-based assessments.
Papers
Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Wei-Qiang Zhang
Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation
Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Zhen Li