Multimodal Model

Multimodal models integrate information from multiple sources like text, images, audio, and video to achieve a more comprehensive understanding than unimodal approaches. Current research focuses on improving model interpretability, addressing biases, enhancing robustness against adversarial attacks and missing data, and developing efficient architectures like transformers and state-space models for various tasks including image captioning, question answering, and sentiment analysis. These advancements are significant for applications ranging from healthcare and robotics to more general-purpose AI systems, driving progress in both fundamental understanding and practical deployment of AI.

Papers

October 29, 2024

Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data
Badr AlKhamissi, Yingtian Tang, Abdülkadir Gökce, Johannes Mehrer, Martin Schrimpf
Large Language Model Vision Language Model Training Data Development Activity Multimodal Model Human Like Analysis by Synthesis Cognitive Development

October 26, 2024

Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML, CNN, and GCN Models using Audio-Visual Features
Abdelrahman Abdelwahab, Akshaj Vishnubhatla, Ayaan Vaswani, Advait Bharathulwar, Arnav Kommaraju
Audio Visual Multimodal Model CNN Network Micro Expression Classical Machine Learning Deception Detection GCN Model Lie Detection

October 25, 2024

October 24, 2024

October 21, 2024

October 20, 2024

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
Songtao Jiang, Yan Zhang, Ruizhe Chen, Yeying Jin, Zuozhu Liu
Multimodal Model Direct Preference Optimization Preference Optimization Fine Grained Attribute

October 17, 2024

October 16, 2024

October 15, 2024

Unveiling the Mystery of Visual Attributes of Concrete and Abstract Concepts: Variability, Nearest Neighbors, and Challenging Categories
Tarun Tater, Sabine Schulte im Walde, Diego Frassinelli
Visual Representation Multimodal Model Nearest Neighbor Concept Identification Visual Feature Diverse Image Visual Attribute Abstract Text

October 13, 2024

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, Weijia Li
Synthetic Data Multimodal Data Multimodal Model Generated Content Synthetic Text

October 10, 2024

ElasticTok: Adaptive Tokenization for Image and Video
Wilson Yan, Matei Zaharia, Volodymyr Mnih, Pieter Abbeel, Aleksandra Faust, Hao Liu
Multimodal Model Source Video Elastic Net Adaptive Tokenization

Multimodal Model

Papers

Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data

Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML, CNN, and GCN Models using Audio-Visual Features

Turn-by-Turn Indoor Navigation for the Visually Impaired

A Multimodal Approach For Endoscopic VCE Image Classification Using BiomedCLIP-PubMedBERT

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques

A Survey of Multimodal Sarcasm Detection

DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

Promoting cross-modal representations to improve multimodal foundation models for physiological signals

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

Multimodal Learning for Embryo Viability Prediction in Clinical IVF

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Harnessing Webpage UIs for Text-Rich Visual Understanding

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Unveiling the Mystery of Visual Attributes of Concrete and Abstract Concepts: Variability, Nearest Neighbors, and Challenging Categories

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

ElasticTok: Adaptive Tokenization for Image and Video