Convolution Augmented Transformer

Convolution-augmented transformers (CATs), also known as Conformers, combine the global context modeling of transformers with the local feature extraction of convolutional neural networks. Current research focuses on applying CATs to diverse tasks, including speech processing (recognition, enhancement, translation), image processing (segmentation, super-resolution), and music information retrieval (cover song identification), demonstrating their effectiveness across modalities. This architectural approach improves performance in various applications by leveraging the strengths of both transformer and convolutional architectures, leading to more robust and efficient models.

Papers