Generated Caption
Image and video captioning research aims to automatically generate descriptive text summarizing visual content, improving accessibility and enabling new applications in diverse fields. Current efforts focus on enhancing model accuracy and addressing limitations like bias and hallucination through techniques such as improved data alignment, graph-based captioning, and the integration of large language models (LLMs) with various encoder-decoder architectures, including transformers and LSTMs. These advancements are driving progress in areas such as remote sensing, medical image analysis, and retail analytics, where automated captioning can facilitate efficient data processing and analysis. Furthermore, research is actively exploring methods for improving caption quality, including length control, sentiment analysis, and the incorporation of contextual information.
Papers
Transparent Human Evaluation for Image Captioning
Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, Noah A. Smith
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, Zheng-Jun Zha