Acoustic Word Embeddings

Acoustic word embeddings (AWEs) are fixed-length vector representations of spoken words, aiming to capture both phonetic and semantic information for improved speech processing. Current research focuses on enhancing AWE models using techniques like self-supervised learning (e.g., HuBERT, Wav2vec 2.0), multi-view learning (combining acoustic and textual data), and various deep metric learning loss functions (e.g., proxy losses). These advancements are improving performance in diverse applications, including keyword spotting, speech emotion recognition, and low-resource language processing, by enabling more accurate and efficient analysis of spoken language.

Papers