Multimodal Language Model

Multimodal language models (MLLMs) aim to integrate and process information from multiple modalities, such as text, images, and video, to achieve a more comprehensive understanding of the world. Current research focuses on improving MLLM performance through techniques like fine-grained reward models, knowledge distillation to create smaller, more efficient models, and data augmentation strategies to address data scarcity and biases. These advancements are significant because they enhance the reliability and applicability of MLLMs across diverse fields, including medical diagnosis, video summarization, and autonomous driving, by enabling more accurate and nuanced interpretations of complex multimodal data.

Papers

August 17, 2024

Measuring Agreeableness Bias in Multimodal Models
Jaehyuk Lim, Bruce W. Lee
Multimodal Model Prior Knowledge Multiple Choice Question Multimodal Language Model Visual Sycophancy

August 1, 2024

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, Ranjay Krishna
Multimodal LLM Dense Correspondence Multimodal Language Model 3D Object Detection Benchmark Object Correspondence

July 31, 2024

Learning Video Context as Interleaved Multimodal Sequences
Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou
Video Understanding Multimodal Language Model Video Context Multimodal Sequence Interleaved Multimodal

July 29, 2024

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
Wenjun Huang, Jianguo Hu
Multimodal Large Language Model Multimodal Model Multi Modal Large Language Model Mamba Based Multimodal Language Model Mamba Based Model Mamba Language Model

July 27, 2024

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong Sun
Multimodal Large Language Model Critique Ability Text Encoder Multimodal Language Model Text Rich Image Text Comprehension

July 19, 2024

On Pre-training of Multimodal Language Models Customized for Chart Understanding
Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal
Multimodal Large Language Model Multimodal Language Model Chart Comprehension Chart Image Language Reasoning Scientific Chart

July 7, 2024

Multimodal Language Models for Domain-Specific Procedural Video Summarization
Nafisa Hussain
Multimodal Model Video Summarization Multimodal Language Model Long Video Long Form Video Video Assistant

July 6, 2024

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
Tiancheng Zhao, Qianqian Zhang, Kyusong Lee, Peng Liu, Lu Zhang, Chunxin Fang, Jiajia Liao, Kelei Jiang, Yibo Ma, Ruochen Xu
Video Understanding Long Context Complete Recipe Multimodal Pre Visual Understanding Video Understanding Task Multimodal Language Model Visual Temporal

July 1, 2024

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li
Cross Domain Autonomous Agent Multimodal Language Model Agent Benchmark Rule Abstraction

June 22, 2024

MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
Xirui Li, Hengguang Zhou, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Cho-Jui Hsieh
Multimodal Benchmark Multimodal Language Model Network Sensitivity Multimodal Query Cognitive Distortion

June 18, 2024

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada
Large Language Model Language Model Speech Analysis Non Negative Textual Response Multimodal Language Model Speech Segment Parallel Generation Response Latency

June 17, 2024

WeatherQA: Can Multimodal Language Models Reason about Severe Weather?
Chengqian Ma, Zhanxiang Hua, Alexandra Anderson-Frey, Vikram Iyer, Xin Liu, Lianhui Qin
Multimodal Language Model Extreme Event Weather Benchmark Severe Weather

June 13, 2024

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna
Thought Reasoning Multimodal Language Model Creative Sketching Visual Reasoning Task Perspective Sketch

June 12, 2024

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
Gameplay Video Multimodal Language Model Video Understanding Benchmark AI Reasoning Multimodal Video Understanding General World Model

June 10, 2024

TRINS: Towards Multimodal Language Models that Can Read
Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun
Multimodal Large Language Model Vision Language MAESTRO Dataset Multimodal Language Model Text Rich Image

June 7, 2024

DualTime: A Dual-Adapter Multimodal Language Model for Time Series Representation
Weiqi Zhang, Jiexia Ye, Ziyue Li, Jia Li, Fugee Tsung
Language Model Time Series Time Matter Multimodal Data Multimodal Language Model Time Series Representation

June 4, 2024

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing
Viet Anh Trinh, Rosy Southwell, Yiwen Guan, Xinlu He, Zhiyong Wang, Jacob Whitehill
Large Language Model Speech Recognition Multi Modal Model Multimodal Transformer Multimodal Language Model Speech Discrete Token

May 30, 2024

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild
Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, Xingquan Zhu
Multimodal Language Model Multi Modal Language Model Wild Image Image Geolocation

May 24, 2024

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro
Vision Language Vision Model Large Language Mamba Based Multimodal Language Model Visual Instruction Tuning Rationale Alignment MeTeoR Track

May 3, 2024

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Mikel Artetxe, Yi Tay
Human Evaluation Much Progress Automatic Evaluation Multimodal Language Model Multi Modal Dialogue Evaluation Suite