LLaVA HD
LLaVA (Large Language and Vision Assistant) is a multimodal large language model designed to improve the interaction between vision and language processing, primarily focusing on enhancing image understanding and generation capabilities. Current research emphasizes improving LLaVA's performance through various techniques, including knowledge graph augmentation, multi-graph alignment algorithms, and efficient knowledge distillation methods to create smaller, faster models. This research is significant because it advances the development of more robust and efficient multimodal models with applications in diverse fields such as medicine, robotics, and education, ultimately pushing the boundaries of AI's ability to understand and interact with the world.
Papers
LLaNA: Large Language and NeRF Assistant
Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig