Multi Image
Multi-image processing focuses on analyzing and integrating information from multiple images to achieve a more comprehensive understanding than is possible with single images alone. Current research emphasizes developing large multimodal models (LMMs) capable of handling diverse multi-image tasks, including relational association, scene understanding, and object co-segmentation, often employing transformer architectures and contrastive learning techniques to improve performance. These advancements are crucial for improving various applications, such as visual question answering, medical image analysis, and robust image generation, by enabling more nuanced and accurate interpretations of complex visual data. Benchmark datasets are being developed to rigorously evaluate the capabilities of these models and identify areas for future improvement.
Papers
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang