LLaVA HD
LLaVA (Large Language and Vision Assistant) is a multimodal large language model designed to improve the interaction between vision and language processing, primarily focusing on enhancing image understanding and generation capabilities. Current research emphasizes improving LLaVA's performance through various techniques, including knowledge graph augmentation, multi-graph alignment algorithms, and efficient knowledge distillation methods to create smaller, faster models. This research is significant because it advances the development of more robust and efficient multimodal models with applications in diverse fields such as medicine, robotics, and education, ultimately pushing the boundaries of AI's ability to understand and interact with the world.
Papers
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng