Paper ID: 2206.06680

Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction

Andreas Triantafyllopoulos, Meishu Song, Zijiang Yang, Xin Jing, Björn W. Schuller

In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of `soft' feature selection. The emotion and enrolment encoders are based on two standard audio architectures: CNN14 and CNN10. The two encoders are further guided to forget or learn auxiliary emotion and/or speaker information. Our best approach achieves a CCC of $.650$ on the ExVo Few-Shot dev set, a $2.5\%$ increase over our baseline CNN14 CCC of $.634$.

Submitted: Jun 14, 2022