Noisy Corpus
Noisy corpora, datasets containing inaccuracies or inconsistencies, pose significant challenges for machine learning models, particularly in speech recognition and natural language processing. Current research focuses on developing robust methods to handle this noise, including techniques like explicit denoising in retrieval-augmented generation (RAG) and data augmentation strategies tailored to specific data characteristics (e.g., children's speech). These advancements are crucial for improving the accuracy and reliability of various applications, from speech-to-text systems to information retrieval and text de-duplication, ultimately leading to more effective and efficient AI systems.
Papers
September 13, 2024
June 19, 2024
June 6, 2024
February 23, 2024
March 13, 2023
October 9, 2022