Paper ID: 2410.04704

Modeling and Estimation of Vocal Tract and Glottal Source Parameters Using ARMAX-LF Model

Kai Lia, Masato Akagia, Yongwei Lib, Masashi Unokia

Modeling and estimation of the vocal tract and glottal source parameters of vowels from raw speech can be typically done by using the Auto-Regressive with eXogenous input (ARX) model and Liljencrants-Fant (LF) model with an iteration-based estimation approach. However, the all-pole autoregressive model in the modeling of vocal tract filters cannot provide the locations of anti-formants (zeros), which increases the estimation errors in certain classes of speech sounds, such as nasal, fricative, and stop consonants. In this paper, we propose the Auto-Regressive Moving Average eXogenous with LF (ARMAX-LF) model to extend the ARX-LF model to a wider variety of speech sounds, including vowels and nasalized consonants. The LF model represents the glottal source derivative as a parametrized time-domain model, and the ARMAX model represents the vocal tract as a pole-zero filter with an additional exogenous LF excitation as input. To estimate multiple parameters with fewer errors, we first utilize the powerful nonlinear fitting ability of deep neural networks (DNNs) to build a mapping from extracted glottal source derivatives or speech waveforms to corresponding LF parameters. Then, glottal source and vocal tract parameters can be estimated with fewer estimation errors and without any iterations as in the analysis-by-synthesis strategy. Experimental results with synthesized speech using the linear source-filter model, synthesized speech using the physical model, and real speech signals showed that the proposed ARMAX-LF model with a DNN-based estimation method can estimate the parameters of both vowels and nasalized sounds with fewer errors and estimation time.

Submitted: Oct 7, 2024