Multimodal Encoder

Multimodal encoders are computational models designed to process and integrate information from multiple data sources, such as images, text, audio, and sensor readings, to achieve a unified understanding. Current research focuses on improving the alignment and fusion of these modalities, often employing transformer-based architectures and contrastive learning techniques to create robust representations suitable for various downstream tasks. This work is significant for its potential to enhance applications across diverse fields, including robotics, 3D printing, medical image analysis, and natural language processing, by enabling more sophisticated and context-aware systems.

Papers

September 30, 2022

Data Poisoning Attacks Against Multimodal Encoders
Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, Yang Zhang
Text Modality Multimodal Model Different Modality Data Poisoning Attack Visual Modality Multimodal Encoder

June 14, 2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang
Video Question Answering Masked Language Modeling Multimodal Encoder Video Language Understanding

May 17, 2022

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation
Sameer Khurana, Antoine Laurent, James Glass
Multimodal Encoder Frame Level Cross Modal Semantic Alignment Cross Lingual Speech Representation

April 18, 2022

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, Edward Raff
Image Generation Natural Language Instruction Human Editing Visual Quality Multimodal Encoder Generation Domain

April 17, 2022

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis
Yan Ling, Jianfei Yu, Rui Xia
Vision Language Model Sentiment Analysis Multimodal Phenomenon Multimodal Encoder

April 16, 2022

UAMD-Net: A Unified Adaptive Multimodal Neural Network for Dense Depth Completion
Guancheng Chen, Junli Lin, Huabiao Qin
Depth Completion Depth Network Depth Prediction Multimodal Encoder Multimodal Neural Network

April 13, 2022

What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data
Oier Mees, Lukas Hermann, Wolfram Burgard
Natural Language Robot Learning Unstructured Data Proper Issue Labeling Offline Imitation Multimodal Encoder

December 10, 2021

VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling
Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, Alexey Gritsenko
Multimodal Input Multimodal Encoder Multi Modal Multi

Multimodal Encoder

Papers

Data Poisoning Attacks Against Multimodal Encoders

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

UAMD-Net: A Unified Adaptive Multimodal Neural Network for Dense Depth Completion

What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data

VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling