Paper ID: 2304.05303

ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity

Sumin Seo, JaeWoong Shin, Jaewoo Kang, Tae Soo Kim, Thijs Kooi

Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (image-text pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs for accurate localization of abnormalities for computer-aided diagnosis (CAD) in CXR. However, we find that the formulation proposed by locality-aware VLP literature actually leads to a loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, we show that ELVIS focuses well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.

Submitted: Apr 11, 2023