Paper ID: 2406.13695
Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models
Stefan Pasch, Dimitirios Petridis, Jannic Cutura
This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.
Submitted: Jun 19, 2024