Paper ID: 2406.15483
Duplicate Detection with GenAI
Ian Ormesher
Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.
Submitted: Jun 17, 2024