Noisy Corpus

Noisy corpora, datasets containing inaccuracies or inconsistencies, pose significant challenges for machine learning models, particularly in speech recognition and natural language processing. Current research focuses on developing robust methods to handle this noise, including techniques like explicit denoising in retrieval-augmented generation (RAG) and data augmentation strategies tailored to specific data characteristics (e.g., children's speech). These advancements are crucial for improving the accuracy and reliability of various applications, from speech-to-text systems to information retrieval and text de-duplication, ultimately leading to more effective and efficient AI systems.

Papers

September 13, 2024

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech
Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert L. MacDonald, Katie Seaver, Richard Cave, Marilyn Ladewig, Rus Heywood, Jordan R. Green
Data Set Automatic Speech Recognition Speech Pattern Noisy Corpus

June 19, 2024

InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising
Zhepei Wei, Wei-Lin Chen, Yu Meng
Language Model Retrieval Augmented Generation Implicit Denoising Noisy Corpus

June 6, 2024

InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation
David Doukhan, Christine Maertens, William Le Personnic, Ludovic Speroni, Reda Dehak
Speech Corpus Voice Activity Detection Speaker Characteristic Event Type Noisy Corpus

February 23, 2024

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification
Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen
Data Augmentation Speaker Verification Low Resource Speech Data Data Augmentation Method Child Speech Noisy Corpus

March 13, 2023

A Human Subject Study of Named Entity Recognition (NER) in Conversational Music Recommendation Queries
Elena V. Epure, Romain Hennequin
Entity Recognition Music Recommendation Human Subject Class Incremental NER Noisy Corpus

October 9, 2022

Noise-Robust De-Duplication at Scale
Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
Large Corpus Visual Analogue Scale Data Deduplication Noisy Corpus

March 29, 2022

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning
Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto
Text to Speech Speech Synthesis Utterance Representation Noisy Corpus

Noisy Corpus

Papers

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech

InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising

InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

A Human Subject Study of Named Entity Recognition (NER) in Conversational Music Recommendation Queries

Noise-Robust De-Duplication at Scale

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning