Transformer Inference
Transformer inference focuses on optimizing the execution of transformer-based models, aiming to reduce latency, memory usage, and power consumption while maintaining accuracy. Current research emphasizes efficient implementations on specialized hardware like FPGAs, employing techniques such as model compression (pruning, quantization), algorithmic optimizations (parallel decoding, dynamic pruning), and novel architectures (linear-cost transformers). These advancements are crucial for deploying large-scale transformer models in resource-constrained environments and real-time applications, impacting fields ranging from natural language processing and computer vision to scientific computing.
Papers
March 27, 2023
February 28, 2023
February 27, 2023
February 17, 2023
January 23, 2023
December 6, 2022
November 18, 2022
November 17, 2022
November 9, 2022
November 2, 2022
October 27, 2022
July 11, 2022
July 4, 2022
June 30, 2022
June 1, 2022
January 29, 2022
December 28, 2021
December 3, 2021