Paper ID: 2406.15483

Duplicate Detection with GenAI

Ian Ormesher

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.

Submitted: Jun 17, 2024