Speech Input

Speech input processing focuses on enabling computers to understand and respond to human speech, aiming to bridge the gap between human communication and machine interaction. Current research emphasizes improving the robustness and accuracy of speech recognition across diverse accents, noise levels, and speaking styles, often employing large language models (LLMs) and deep learning architectures like transformers and convolutional recurrent networks. This field is crucial for advancing human-computer interaction, impacting applications ranging from virtual assistants and accessibility tools to more sophisticated multimodal systems capable of understanding both speech and visual information.

Papers

August 22, 2022

DualVoice: Speech Interaction that Discriminates between Normal and Whispered Voice Input
Jun Rekimoto
Speech Recognition Voice Based Speech Input Open Whisper Style Speech Model Speech Interaction Whispered Speech

July 18, 2022

Audio Input Generates Continuous Frames to Synthesize Facial Video Using Generative Adiversarial Networks
Hanhaodi Zhang
Generative Adversarial Network Gated Recurrent Unit Face Synthesis Speech Input Consecutive Frame Video to Speech Synthesis

June 17, 2022

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE
Marc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber
Acoustic Feature Articulatory Representation Speech Input Articulatory Feature Articulatory Signal

April 12, 2022

ASR in German: A Detailed Error Analysis
Johannes Wirth, Rene Peinl
Automatic Speech Recognition Speech Recognition Error Analysis Speech Input

March 16, 2022

Whither the Priors for (Vocal) Interactivity?
Roger K. Moore
Non Humanoid Robot Human Robot Interaction Interactive No Code SAM Prior Vocal Performance Speech Input Voice Interaction

February 26, 2022

Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems
Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon
Training Data Speech Data Spoken Language Understanding Community Need Speech Input Text Only Training

February 16, 2022

TalkTive: A Conversational Agent Using Backchannels to Engage Older Adults in Neurocognitive Disorders Screening
Zijian Ding, Jiawen Kang, Tinky Oi Ting HO, Ka Ho Wong, Helene H. Fung, Helen Meng, Xiaojuan Ma
Conversational Agent Cognitive Impairment Older Adult Speech Input Cognitive Test Backchannel Prediction

January 4, 2022

Speech-to-SQL: Towards Speech-driven SQL Query Generation From Natural Language Question
Yuanfeng Song, Raymond Chi-Wing Wong, Xuefang Zhao, Di Jiang
Natural Language Question Speech Input Text to SQL Datasets

December 27, 2021

Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech
Gaoussou Youssouf Kebe, Luke E. Richards, Edward Raff, Francis Ferraro, Cynthia Matuszek
Natural Language Language Grounding Speech Input Language Acquisition Perceptual Information Self Supervised Speech Representation Model

Speech Input

Papers

DualVoice: Speech Interaction that Discriminates between Normal and Whispered Voice Input

Audio Input Generates Continuous Frames to Synthesize Facial Video Using Generative Adiversarial Networks

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE

ASR in German: A Detailed Error Analysis

Whither the Priors for (Vocal) Interactivity?

Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

TalkTive: A Conversational Agent Using Backchannels to Engage Older Adults in Neurocognitive Disorders Screening

Speech-to-SQL: Towards Speech-driven SQL Query Generation From Natural Language Question

Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech