Speech Language Model
Speech language models (SLMs) aim to directly process and generate speech, bypassing the traditional text-based intermediary steps of automatic speech recognition and text-to-speech. Current research focuses on improving SLM architectures, such as hierarchical transformers and encoder-decoder models, often incorporating techniques like self-supervised learning, knowledge distillation, and prompt engineering to enhance efficiency and performance on tasks including speech translation, synthesis, and question answering. These advancements hold significant potential for creating more natural and intuitive human-computer interaction, particularly in applications requiring real-time speech processing and generation.
Papers
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, Michelle Tadmor Ramanovich
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng, Xuedong Huang