Paper ID: 2304.02278
SCMM: Calibrating Cross-modal Representations for Text-Based Person Search
Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Victor C.M. Leung
Text-Based Person Search (TBPS) is a crucial task that enables accurate retrieval of target individuals from large-scale galleries with only given textual caption. For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space to reduce the inter-modal gap. Furthermore, learning detailed image-text correspondences is essential to discriminate similar targets and enable fine-grained search. To address these challenges, we present a simple yet effective method named Sew Calibration and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings. SCMM is distinguished by two novel losses to provide fine-grained cross-modal representations: 1) a Sew calibration loss that takes the quality of textual captions as guidance and aligns features between image and text modalities, and 2) a Masked Caption Modeling (MCM) loss that leverages a masked caption prediction task to establish detailed and generic relationships between textual and visual parts. The dual-pronged strategy refines feature alignment and enriches cross-modal correspondences, enabling the accurate distinction of similar individuals. Consequently, its streamlined dual-encoder architecture avoids complex branches and interactions and facilitates high-speed inference suitable for real-time requirements. Consequently, high-speed inference is achieved, which is essential for resource-limited applications often demanding real-time processing. Extensive experiments on three popular TBPS benchmarks demonstrate the superiority of SCMM, achieving top results with 73.81%, 74.25%, and 57.35% Rank-1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReID, respectively. We hope SCMM's scalable and cost-effective design will serve as a strong baseline and facilitate future research in this field.
Submitted: Apr 5, 2023