Token and Duration Transducer

Token-and-Duration Transducers (TDTs) are a novel sequence-to-sequence model architecture designed to improve the speed and accuracy of tasks like speech recognition and translation. Research focuses on optimizing TDT inference through novel decoding algorithms, such as label-looping, which prioritize label processing over frame-by-frame analysis, leading to significant speedups. This approach, by jointly predicting tokens and their durations, allows for faster processing by skipping irrelevant input frames, resulting in improved efficiency and accuracy across various applications compared to traditional transducer models.

Papers