Multimodal Large Language Model
Multimodal large language models (MLLMs) integrate multiple data modalities, such as text, images, and audio, to enhance understanding and reasoning capabilities beyond those of unimodal models. Current research emphasizes improving MLLM performance through refined architectures (e.g., incorporating visual grounding, chain-of-thought prompting), mitigating biases and hallucinations, and developing robust evaluation benchmarks that assess various aspects of multimodal understanding, including active perception and complex reasoning tasks. This work is significant because it pushes the boundaries of AI capabilities, leading to advancements in diverse applications like medical diagnosis, financial analysis, and robotic manipulation.
497papers
Papers - Page 6
February 28, 2025
February 27, 2025
A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
Protecting multimodal large language models against misleading visualizations
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs
When Continue Learning Meets Multimodal Large Language Model: A Survey
Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack
February 26, 2025
February 25, 2025
February 24, 2025
February 23, 2025
February 22, 2025
February 21, 2025