Captioning Metric

Captioning metrics evaluate the quality of automatically generated image or video descriptions, aiming to align automated assessments with human judgment. Recent research focuses on developing reference-free metrics, leveraging large multimodal models like CLIP to compare generated captions directly with their corresponding visual content, often incorporating hierarchical or fine-grained comparisons to improve accuracy and interpretability. These advancements address limitations of traditional reference-based metrics, which rely on scarce human-annotated data and may not capture the nuances of modern, highly detailed captioning models, ultimately improving the evaluation and development of image and video captioning systems.

Papers