Generated Caption
Image and video captioning research aims to automatically generate descriptive text summarizing visual content, improving accessibility and enabling new applications in diverse fields. Current efforts focus on enhancing model accuracy and addressing limitations like bias and hallucination through techniques such as improved data alignment, graph-based captioning, and the integration of large language models (LLMs) with various encoder-decoder architectures, including transformers and LSTMs. These advancements are driving progress in areas such as remote sensing, medical image analysis, and retail analytics, where automated captioning can facilitate efficient data processing and analysis. Furthermore, research is actively exploring methods for improving caption quality, including length control, sentiment analysis, and the incorporation of contextual information.
Papers
Decoding fMRI Data into Captions using Prefix Language Modeling
Vyacheslav Shen, Kassymzhomart Kunanbayev, Dae-Shik Kim
Multi-LLM Collaborative Caption Generation in Scientific Documents
Jaeyoung Kim, Jongho Lee, Hong-Jun Choi, Ting-Yao Hsu, Chieh-Yang Huang, Sungchul Kim, Ryan Rossi, Tong Yu, Clyde Lee Giles, Ting-Hao 'Kenneth' Huang, Sungchul Choi