Visual Encoder
Visual encoders are fundamental components of large vision-language models (LVLMs), tasked with extracting meaningful information from images to facilitate tasks like object recognition, scene understanding, and multimodal reasoning. Current research focuses on improving encoder efficiency (e.g., through token compression and adaptive token reduction), mitigating issues like object hallucination and enhancing robustness across diverse visual inputs (e.g., high-resolution images, multi-camera data, and various channel configurations). These advancements are crucial for deploying LVLMs in resource-constrained environments and expanding their applicability to real-world scenarios, such as autonomous driving and robotic manipulation. The development of more efficient and robust visual encoders is driving progress in various fields, including computer vision, natural language processing, and robotics.
Papers
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo
Improving Multi-modal Large Language Model through Boosting Vision Capabilities
Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, Saining Xie
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
Yuying Shang, Xinyi Zeng, Yutao Zhu, Xiao Yang, Zhengwei Fang, Jingyuan Zhang, Jiawei Chen, Zinan Liu, Yu Tian