Paper ID: 2202.02115

Polyphonic pitch detection with convolutional recurrent neural networks

Carl Thomé, Sven Ahlbäck

Recent directions in automatic speech recognition (ASR) research have shown that applying deep learning models from image recognition challenges in computer vision is beneficial. As automatic music transcription (AMT) is superficially similar to ASR, in the sense that methods often rely on transforming spectrograms to symbolic sequences of events (e.g. words or notes), deep learning should benefit AMT as well. In this work, we outline an online polyphonic pitch detection system that streams audio to MIDI by ConvLSTMs. Our system achieves state-of-the-art results on the 2007 MIREX multi-F0 development set, with an F-measure of 83\% on the bassoon, clarinet, flute, horn and oboe ensemble recording without requiring any musical language modelling or assumptions of instrument timbre.

Submitted: Feb 4, 2022

Topics

Language Model
Automatic Speech Recognition
Music Transcription
Convolutional Recurrent Neural Network
Polyphonic Sound

Links

arXiv PDF