Long Form

Long-form speech recognition aims to accurately transcribe extended audio recordings, addressing challenges posed by the length and complexity of such data. Current research focuses on improving existing models like Conformers and Neural Transducers, often incorporating techniques like large language model (LLM) integration and memory augmentation to handle long-range dependencies and reduce errors. These advancements are crucial for improving the accuracy and efficiency of speech-to-text systems in various applications, including transcription of lectures, meetings, and other extended audio content. Furthermore, research is actively exploring methods to mitigate issues like long-form deletion and train-test data mismatch.

Papers

June 24, 2024

Exploring the Capability of Mamba in Speech Applications
Koichi Miyazaki, Yoshiki Masuyama, Masato Murata
Automatic Speech Recognition Transformer Based Model Speech Synthesis Capability Evolution Mamba in Mamba Speech Application Long Form

June 5, 2024

A Frame-based Attention Interpretation Method for Relevant Acoustic Feature Extraction in Long Speech Depression Detection
Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia
Feature Extraction Depression Detection Audio Spectrogram Transformer Frame Attention Long Form Speech Based Depression Detection

May 23, 2024

Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition
Chan-Jan Hsu, Yi-Chang Chen, Feng-Ting Liao, Pei-Chen Ho, Yu-Hsiang Wang, Po-Chun Hsu, Da-shan Shiu
Large Language Model Practical Algorithm Text Recognition Chinese Text Recognition Long Form

March 20, 2024

Advanced Long-Content Speech Recognition With Factorized Neural Transducer
Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian
Long Form

December 18, 2023

Improved Long-Form Speech Recognition by Jointly Modeling the Primary and Non-primary Speakers
Guru Prakash Arumugam, Shuo-yiin Chang, Tara N. Sainath, Rohit Prabhavalkar, Quan Wang, Shaan Bijwadia
Importance Aware Automatic Speech Recognition Model Speech Driven Long Form Long Form Deletion Mismatch Classification

September 26, 2023

Updated Corpora and Benchmarks for Long-Form Speech Recognition
Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara, Corey Miller, Migüel Jetté
New Benchmark Large Corpus Attention Based Encoder Decoder Long Form Speech Recognition Corpus

September 22, 2023

Memory-augmented conformer for improved end-to-end long-form ASR
Carlos Carvalho, Alberto Abad
Automatic Speech Recognition Attention Based Model Memory Augmented Long Form

September 15, 2023

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition
Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney
Speech Recognition Attention Based Encoder Decoder Streaming Model Long Form

June 28, 2023

Accelerating Transducers through Adjacent Token Merging
Yuang Li, Yu Wu, Jinyu Li, Shujie Liu
Automatic Speech Recognition Token Merging Sequence Transducer Long Form

June 13, 2023

Large-scale Language Model Rescoring on Long-form Data
Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley
Language Model Automatic Speech Recognition Large Scale Language Model Long Form Language Model Rescoring Long Text Data

May 28, 2023

Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR
W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath
Language Model Semantic Segmentation Long Form Sentence Boundary

May 24, 2023

RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models
David Qiu, David Rim, Shaojin Ding, Oleg Rybakov, Yanzhang He
Automatic Speech Recognition Model Compression Long Form Quantization Noise Quantization Scale

May 18, 2023

FunASR: A Fundamental End-to-End Speech Recognition Toolkit
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, Shiliang Zhang
Speech Recognition End to End Speech Processing Long Form Non Autoregressive End to End

December 5, 2022

LMEC: Learnable Multiplicative Absolute Position Embedding Based Conformer for Speech Recognition
Yuguang Yang, Yu Pan, Jingjing Yin, Heng Lu
Speech Recognition Linear Attention One Pas Multiple Conformer Long Form Absolute Position

November 17, 2022

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer
Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian
Neural Transducer Long Form Context Token

April 22, 2022

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu
Speech Recognition End to End Voice Activity Detection Joint Segmentation Long Form

February 22, 2022

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition
Jinhan Wang, Xiaosu Tong, Jinxi Guo, Di He, Roland Maas
Speech Recognition Long Form Speech Translation Corpus Overlapped Speech Detection