Modal Large Language Model
Multi-modal large language models (MLLMs) integrate textual and visual information to perform complex reasoning tasks, aiming to bridge the gap between current AI capabilities and human-level intelligence. Current research focuses on addressing MLLM limitations such as hallucinations and biases, particularly in low-level visual perception and abstract reasoning, through improved model architectures, benchmark development, and training techniques like chain-of-thought prompting and instruction fine-tuning. These advancements are crucial for enhancing the reliability and trustworthiness of MLLMs across diverse applications, from healthcare diagnostics to educational tools and scientific problem-solving.
Papers
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning
Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Yuxiao Dong, Jie Tang
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Yuxiao Dong, Jie Tang