Generated Caption
Image and video captioning research aims to automatically generate descriptive text summarizing visual content, improving accessibility and enabling new applications in diverse fields. Current efforts focus on enhancing model accuracy and addressing limitations like bias and hallucination through techniques such as improved data alignment, graph-based captioning, and the integration of large language models (LLMs) with various encoder-decoder architectures, including transformers and LSTMs. These advancements are driving progress in areas such as remote sensing, medical image analysis, and retail analytics, where automated captioning can facilitate efficient data processing and analysis. Furthermore, research is actively exploring methods for improving caption quality, including length control, sentiment analysis, and the incorporation of contextual information.
Papers
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
Rui Cao, Ming Shan Hee, Adriel Kuek, Wen-Haw Chong, Roy Ka-Wei Lee, Jing Jiang