State of the Art Whisper
Whisper, a large-scale multilingual speech recognition model, is the focus of intense research aimed at improving its accuracy, efficiency, and robustness across diverse speech characteristics and applications. Current research emphasizes adapting Whisper for low-resource languages, improving streaming capabilities, mitigating adversarial attacks, and integrating it with other modalities like vision for audio-visual speech recognition. These advancements have significant implications for various fields, including healthcare (e.g., aphasia diagnosis), accessibility (e.g., improved speech-to-text for individuals with speech impairments), and security (e.g., developing defenses against malicious audio manipulation).
Papers
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass
Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection
Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li
Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian
Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper
Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee
Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated Assessment
Huma Ameer, Seemab Latif, Iram Tariq Bhatti, Rabia Latif