Duplicate Detection

Duplicate detection aims to identify identical or near-identical items across diverse data types, ranging from text and images to software code and medical scans. Current research focuses on developing robust algorithms and models, including Siamese networks, transformers, and locality-sensitive hashing, to handle various data modalities and address challenges like fuzzy duplicates and near-duplicates caused by subtle transformations. These advancements are crucial for improving data quality, enhancing search efficiency, protecting intellectual property, and automating tasks in various fields, from software engineering and customer relationship management to medical imaging and copyright enforcement.

Papers