Visual Understanding
Visual understanding research aims to enable computers to interpret and reason about images and videos as humans do, focusing on tasks like object recognition, scene description, and complex visual reasoning. Current research heavily utilizes large language and vision models (LLVMs), often incorporating vision transformers and leveraging techniques like chain-of-thought prompting and visual instruction tuning to improve performance. This field is crucial for advancing artificial intelligence, with applications ranging from robotics and autonomous driving to medical image analysis and accessibility tools for visually impaired individuals.
Papers
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Talfan Evans, Shreya Pathak, Hamza Merzic, Jonathan Schwarz, Ryutaro Tanno, Olivier J. Henaff
GlitchBench: Can large multimodal models detect video game glitches?
Mohammad Reza Taesiri, Tianjun Feng, Anh Nguyen, Cor-Paul Bezemer