Multimodal Teacher

Multimodal teacher models leverage information from multiple data sources (e.g., images, text, audio) to train more robust and accurate student models that may only use a single modality at inference time. Current research focuses on improving knowledge distillation techniques, exploring various architectures like transformers and autoencoders, and addressing challenges such as modality mismatches and limited training data through methods such as disentanglement learning and multi-teacher approaches. This area is significant because it allows for the development of high-performing single-modality models without the computational cost of using all modalities during inference, impacting diverse applications from video captioning and event detection to cross-domain adaptation and machine translation.

Papers