Multimodal Understanding

October 21, 2024

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Xiang Bai
New Framework Multimodal Large Language Model Multimodal Understanding LLaVA HD
Task-oriented Robotic Manipulation with Vision Language Models
Nurhan Bulus Guran, Hanchi Ren, Jingjing Deng, Xianghua Xie
Language Model Vision Language Model Robotic Manipulation Spatial Reasoning Multimodal Understanding

October 20, 2024

Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison
Shiyu Hu, Xuchen Li, Xuzhao Li, Jing Zhang, Yipei Wang, Xin Zhao, Kang Hao Cheong
Large Vision Language Model Gameplay Video Real Human Video Understanding Human Machine Multimodal Understanding Video Annotation

October 18, 2024

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Yin Xie, Kaicheng Yang, Ninghua Yang, Weimo Deng, Xiangzi Dai, Tiancheng Gu, Yumeng Wang, Xiang An, Yongle Zhao, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng
Large Multimodal Model Visual Token Multimodal Understanding

October 17, 2024

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo
Faithful Generation Multimodal Phenomenon Multimodal Model Multimodal Understanding Visual Encoder State of the Art Encoders Visual Encoding

October 14, 2024

Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation
Shun Qian, Bingquan Liu, Chengjie Sun, Zhen Xu, Baoxun Wang
Data Aggregation Visual Token Multimodal Understanding Multi Modal Language Model Multi Level Feature Multi Modal Benchmark Camera Projector Modal Large Language Model

October 13, 2024

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo
Fine Grained Vision Language Model Large Vision Language Model Cross Modal Retrieval Multimodal Understanding Compositional Language

October 8, 2024

Aria: An Open Multimodal Native Mixture-of-Experts Model
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren, Chao Li, Yifan Ye, Lihuan Zhang, Hanshu Yan, Guoyin Wang, Bei Chen, Junnan Li
Large Multimodal Model Multimodal Understanding

October 4, 2024

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation
Sen Fang, Yalin Feng, Sizhou Chen, Xiaofeng Zhang, Teik Toe Teoh
Text Modality Multimodal Model Audio Driven Sequence of Sequence Multimodal Representation Multimodal Understanding Gloss Translation Multi Modal Training

October 1, 2024

September 20, 2024

ChemDFM-X: Towards Large Multimodal Model for Chemistry
Zihan Zhao, Bo Chen, Jingpiao Li, Lu Chen, Liyang Wen, Pengyu Wang, Zichen Zhu, Danyang Zhang, Ziping Wan, Yansi Li, Zhongyang Dai, Xin Chen, Kai Yu
Large Multimodal Model Multimodal Understanding Organic Chemistry Multi Modal Dialogue Chemical Language Model

September 19, 2024

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
Spatio Temporal Multimodal Understanding Pre Trained Convolutional Neural Network LLM Representation Multimodal Architecture Arbitrary Resolution

September 16, 2024

3D-TAFS: A Training-free Framework for 3D Affordance Segmentation
Meng Chu, Xuan Zhang, Zhedong Zheng, Tat-Seng Chua
Affordance Learning Interactive Segmentation Multimodal Understanding CASIA Iris Affordance Segmentation

September 4, 2024

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig
Multimodal Model Reasoning Capability Multimodal Understanding Multimodal AI

August 26, 2024

MMR: Evaluating Reading Ability of Large Multimodal Models
Jian Chen, Ruiyi Zhang, Yufan Zhou, Ryan Rossi, Jiuxiang Gu, Changyou Chen
Large Multimodal Model Critique Ability Multimodal Understanding Text Rich Image Image Text Benchmark

August 22, 2024

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
Faithful Generation Vision Language Task Multimodal Understanding Anti Unification Multi Modal Generation Head Transformer

August 9, 2024

VITA: Towards Open-Source Interactive Omni Multimodal LLM
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun
Multimodal Large Language Model Multimodal LLM Multimodal Understanding Multimodal Benchmark Multimodal Interaction Vicinal Transfer Augmentation

August 1, 2024

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
Yichen Lu, Jiaqi Song, Xuankai Chang, Hengwei Bian, Soumi Maiti, Shinji Watanabe
Language Model Zero Shot Synthetic Data View Translation Multimodal Understanding Audio Visual Speech Recognition Visual Speech

July 29, 2024

Diffusion Feedback Helps CLIP See Better
Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
Text to Image Diffusion Model Contrastive Language Image Image Text Pair Multimodal Understanding Diffusion Control

Papers

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Task-oriented Robotic Manipulation with Vision Language Models

Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Aria: An Open Multimodal Native Mixture-of-Experts Model

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

ChemDFM-X: Towards Large Multimodal Model for Chemistry

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

3D-TAFS: A Training-free Framework for 3D Affordance Segmentation

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

MMR: Evaluating Reading Ability of Large Multimodal Models

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

VITA: Towards Open-Source Interactive Omni Multimodal LLM

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data

Diffusion Feedback Helps CLIP See Better