Audio Token
Audio tokens represent audio signals as discrete units, analogous to words in text, enabling the application of language modeling techniques to audio processing. Current research focuses on improving the consistency and efficiency of these representations, exploring architectures like transformers and employing techniques such as masking and byte-pair encoding to optimize model training and inference. This approach holds significant promise for advancing various audio applications, including text-to-speech synthesis, speech enhancement, and music generation, by bridging the gap between audio and language processing. Benchmarking efforts are underway to compare different tokenization methods and identify optimal configurations for diverse tasks.
Papers
Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition
Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg
FoleyGen: Visually-Guided Audio Generation
Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra