Visual Encoder
Visual encoders are fundamental components of large vision-language models (LVLMs), tasked with extracting meaningful information from images to facilitate tasks like object recognition, scene understanding, and multimodal reasoning. Current research focuses on improving encoder efficiency (e.g., through token compression and adaptive token reduction), mitigating issues like object hallucination and enhancing robustness across diverse visual inputs (e.g., high-resolution images, multi-camera data, and various channel configurations). These advancements are crucial for deploying LVLMs in resource-constrained environments and expanding their applicability to real-world scenarios, such as autonomous driving and robotic manipulation. The development of more efficient and robust visual encoders is driving progress in various fields, including computer vision, natural language processing, and robotics.
Papers
StereoNavNet: Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels
Hongyu Li, Taskin Padir, Huaizu Jiang
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang