Contrastive Captioners

Contrastive captioning focuses on generating descriptive text that accurately and distinctively represents multimedia data (images, audio, video) by leveraging contrastive learning techniques. Current research emphasizes improving the temporal understanding of audio and video, enhancing multimodal alignment between text and visual/audio features through architectures like transformers and incorporating large language models for improved caption generation and evaluation. This approach leads to more robust and informative captions, with applications ranging from improved image and video retrieval to more effective evaluation of existing captioning models and advancements in areas like reverse engineering.

Papers