Contrastive Language Audio
Contrastive Language-Audio Pretraining (CLAP) models are revolutionizing audio processing by learning joint representations of audio and text data. Current research focuses on improving CLAP's performance in various downstream tasks, such as audio source separation, music recommendation, and sound event detection, often addressing challenges like data scarcity and the need for reference signals through techniques like retrieval augmentation and prompt tuning. This cross-modal approach offers significant advantages over traditional methods by enabling zero-shot classification and improving the semantic understanding of audio, leading to more robust and versatile audio analytics tools.
Papers
Can We Estimate Purchase Intention Based on Zero-shot Speech Emotion Recognition?
Ryotaro Nagase, Takashi Sumiyoshi, Natsuo Yamashita, Kota Dohi, Yohei Kawaguchi
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Qiuqiang Kong, Xie Chen