Multimodal Vision Language Model
Multimodal vision-language models (VLMs) aim to integrate visual and textual information, enabling computers to understand and generate descriptions of images. Current research focuses on improving VLM performance in challenging scenarios, such as handling occluded objects, generating diverse and non-generic text, and adapting to low-resource languages, often leveraging architectures like CLIP and transformer-based models. These advancements are significant because they enhance the capabilities of AI systems in various applications, including image retrieval, graphic design, and human-robot interaction, while also raising important considerations regarding privacy and robustness.
Papers
November 4, 2024
October 2, 2024
September 27, 2024
August 25, 2024
July 2, 2024
June 20, 2024
May 31, 2024
May 15, 2024
April 16, 2024
January 22, 2024
January 4, 2024
December 28, 2023
October 19, 2023
October 7, 2023
September 23, 2023
June 15, 2023
March 2, 2023
December 20, 2022