Captioning Datasets
Image captioning datasets are crucial for training and evaluating models that generate textual descriptions of images, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving dataset quality by addressing noise and bias in existing datasets, developing more robust evaluation metrics, and exploring novel training strategies like self-supervised learning and contrastive methods, often employing transformer-based architectures. These advancements are vital for enhancing the accuracy and fluency of generated captions, with implications for applications ranging from image retrieval and accessibility tools to content creation and analysis across diverse domains.
Papers
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin, Stephen Rawls, David Chan, Shalini Ghosh, Anna Rumshisky, Wael Hamza
Cross-Domain Image Captioning with Discriminative Finetuning
Roberto Dessì, Michele Bevilacqua, Eleonora Gualdoni, Nathanael Carraz Rakotonirina, Francesca Franzon, Marco Baroni