Near Duplicate

Near-duplicate detection focuses on identifying highly similar items, whether images, videos, text, or code, across vast datasets. Current research emphasizes developing robust algorithms and model architectures, such as Siamese networks and vision transformers, to effectively capture subtle semantic similarities beyond exact matches, often incorporating techniques like embedding refinement and graph-theoretic approaches. This field is crucial for managing large datasets, mitigating copyright infringement, improving search and recommendation systems, and ensuring fair evaluation in machine learning, with applications ranging from biometric security to software development and online learning platforms.

Papers

March 16, 2023

SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, Ari S. Morcos
Language Model Data Efficient Near Duplicate Data Deduplication Quality Aware Web Scale

December 20, 2022

Unsupervised Question Duplicate and Related Questions Detection in e-learning platforms
Maksimjeet Chowdhary, Sanyam Goyal, Venktesh V, Mukesh Mohania, Vikram Goyal
High Similarity Diverse Platform Duplicate Detection Near Duplicate

November 20, 2022

Semantic Similarity-Based Clustering of Findings From Security Testing Tools
Phillip Schneider, Markus Voggenreiter, Abdullah Gulraiz, Florian Matthes
Semantic Description Intriguing Finding Near Duplicate Software Security Security Testing

October 4, 2022

Mining Duplicate Questions of Stack Overflow
Mihir Kale, Anirudha Rayasam, Radhika Parik, Pranav Dheram
Neural Network Near Duplicate Stack Overflow Community Question Answering

September 18, 2022

Evolution of a Web-Scale Near Duplicate Image Detection System
Andrey Gusev, Jiajing Xu
Specie Evolution Near Duplicate Human Labeled Medium Ecosystem Image Corpus

May 12, 2022

3D Moments from Near-Duplicate Photos
Qianqian Wang, Zhengqi Li, David Salesin, Noah Snavely, Brian Curless, Janne Kontkanen
Temporal Moment Near Duplicate Complex Dynamic Scene Computational Photography Motion Interpolation

May 9, 2022

Sub-Word Alignment Is Still Useful: A Vest-Pocket Method for Enhancing Low-Resource Machine Translation
Minhan Xu, Yu Hong
Alignment Problem Sub Word Near Duplicate Translation Based Low Resource Machine Translation

April 27, 2022

Beyond Duplicates: Towards Understanding and Predicting Link Types in Issue Tracking Systems
Clara Marie Lüders, Abir Bouraffa, Walid Maalej
Human Understanding Link Prediction Near Duplicate Duplicate Detection Issue Tracking System

February 12, 2022

Classification of Microscopy Images of Breast Tissue: Region Duplication based Self-Supervision vs. Off-the Shelf Deep Representations
Aravind Ravi
Self Supervised Learning Supervised ImageNet Deep Network Self Supervision Microscopy Image Deep Feature Deep Representation Near Duplicate Breast Tissue

February 9, 2022

Allocating Duplicate Copies for IoT Data in Cloud Computing Based on Harmony Search Algorithm
Younes Jahandideh, A. Mirzaei
Cloud Computing Serial Reproduction Near Duplicate Internet of Thing Data Cloud Environment Data Analysis Replication Harmony Search

November 30, 2021

Mitigating Adversarial Attacks by Distributing Different Copies to Different Users
Jiyi Zhang, Han Fang, Wesley Joon-Wie Tann, Ke Xu, Chengfang Fang, Ee-Chien Chang
Adversarial Attack Adversarial Sample Inherent Randomness Near Duplicate Multi User Adversarial Region

Near Duplicate

Papers

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Unsupervised Question Duplicate and Related Questions Detection in e-learning platforms

Semantic Similarity-Based Clustering of Findings From Security Testing Tools

Mining Duplicate Questions of Stack Overflow

Evolution of a Web-Scale Near Duplicate Image Detection System

3D Moments from Near-Duplicate Photos

Sub-Word Alignment Is Still Useful: A Vest-Pocket Method for Enhancing Low-Resource Machine Translation

Beyond Duplicates: Towards Understanding and Predicting Link Types in Issue Tracking Systems

Classification of Microscopy Images of Breast Tissue: Region Duplication based Self-Supervision vs. Off-the Shelf Deep Representations

Allocating Duplicate Copies for IoT Data in Cloud Computing Based on Harmony Search Algorithm

Mitigating Adversarial Attacks by Distributing Different Copies to Different Users