Utterance Level
Utterance-level analysis in speech processing focuses on understanding and modeling the information contained within individual spoken turns, going beyond simple frame-level acoustic features. Current research emphasizes learning disentangled representations that separate semantic content from speaker characteristics and other utterance-level attributes, often employing transformer-based models, contrastive learning, and variational autoencoders. This work is crucial for improving various applications, including speech recognition, text-to-speech synthesis, emotion recognition, and dialogue systems, by enabling more nuanced and context-aware processing of conversational data. The development of robust and efficient methods for utterance-level analysis is driving advancements in human-computer interaction and natural language understanding.
Papers
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance
Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi
Fine-grained Noise Control for Multispeaker Speech Synthesis
Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis