Large Vision Language
Large vision-language models (VLMs) aim to integrate visual and textual information, enabling computers to understand and reason about images and text simultaneously. Current research focuses on improving VLM performance in challenging scenarios, such as handling occluded objects in images and extending capabilities to longer videos and more complex tasks like chart comprehension. This involves developing novel architectures, efficient fine-tuning techniques, and large-scale datasets to address limitations in existing models. Advances in VLMs have significant implications for various applications, including robotics, image retrieval, and question answering systems.
Papers
Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring
Avinash Anand, Raj Jaiswal, Abhishek Dharmadhikari, Atharva Marathe, Harsh Parimal Popat, Harshil Mital, Kritarth Prasad, Rajiv Ratn Shah, Roger Zimmermann
LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models
Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis