Paper ID: 2503.08533 • Published Mar 11, 2025
ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Advancements in audio foundation models (FMs) have fueled interest in
end-to-end (E2E) spoken dialogue systems, but different web interfaces for each
system makes it challenging to compare and contrast them effectively. Motivated
by this, we introduce an open-source, user-friendly toolkit designed to build
unified web interfaces for various cascaded and E2E spoken dialogue systems.
Our demo further provides users with the option to get on-the-fly automated
evaluation metrics such as (1) latency, (2) ability to understand user input,
(3) coherence, diversity, and relevance of system response, and (4)
intelligibility and audio quality of system output. Using the evaluation
metrics, we compare various cascaded and E2E spoken dialogue systems with a
human-human conversation dataset as a proxy. Our analysis demonstrates that the
toolkit allows researchers to effortlessly compare and contrast different
technologies, providing valuable insights such as current E2E systems having
poorer audio quality and less diverse responses. An example demo produced using
our toolkit is publicly available here:
this https URL
Figures & Tables
Unlock access to paper figures and tables to enhance your research experience.