Visual Modality
Visual modality research focuses on understanding and leveraging visual information in conjunction with other modalities (like text and audio) for various tasks, primarily aiming to improve the accuracy and robustness of machine learning models. Current research emphasizes multimodal fusion techniques, often employing transformer-based architectures and contrastive learning, to effectively integrate visual features with other data types for applications such as image captioning, semantic segmentation, and machine translation. This field is significant because it enables more sophisticated AI systems capable of understanding complex scenes and interactions, with applications ranging from robotics and augmented reality to improved accessibility and content creation.
Papers
Speech inpainting: Context-based speech synthesis guided by video
Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction
Hejie Cui, Rongmei Lin, Nasser Zalmout, Chenwei Zhang, Jingbo Shang, Carl Yang, Xian Li