Paper ID: 2211.06493

Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of Experts

Xiaofei Wang, Zhuo Chen, Yu Shi, Jian Wu, Naoyuki Kanda, Takuya Yoshioka

Employing a monaural speech separation (SS) model as a front-end for automatic speech recognition (ASR) involves balancing two kinds of trade-offs. First, while a larger model improves the SS performance, it also requires a higher computational cost. Second, an SS model that is more optimized for handling overlapped speech is likely to introduce more processing artifacts in non-overlapped-speech regions. In this paper, we address these trade-offs with a sparsely-gated mixture-of-experts (MoE) architecture. Comprehensive evaluation results obtained using both simulated and real meeting recordings show that our proposed sparsely-gated MoE SS model achieves superior separation capabilities with less speech distortion, while involving only a marginal run-time cost increase.

Submitted: Nov 11, 2022