Cross Modal Retrieval Benchmark

Cross-modal retrieval benchmarks evaluate the ability of systems to retrieve information across different modalities, such as images and text. Current research focuses on improving retrieval accuracy, particularly for long texts and 3D data, often employing large language models (LLMs) and vision-language models (VLMs) in conjunction with techniques like contrastive learning and optimal transport to refine alignments and handle noisy or mismatched data. These advancements are crucial for building robust and versatile multimodal search engines and information retrieval systems, impacting diverse applications from multimedia search to augmented reality. The development of standardized benchmarks, like M-BEIR, is also a key area, facilitating fair comparison and progress in the field.

Papers