Transformer Inference
Transformer inference focuses on optimizing the execution of transformer-based models, aiming to reduce latency, memory usage, and power consumption while maintaining accuracy. Current research emphasizes efficient implementations on specialized hardware like FPGAs, employing techniques such as model compression (pruning, quantization), algorithmic optimizations (parallel decoding, dynamic pruning), and novel architectures (linear-cost transformers). These advancements are crucial for deploying large-scale transformer models in resource-constrained environments and real-time applications, impacting fields ranging from natural language processing and computer vision to scientific computing.
Papers
CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference
Mohammad Erfan Sadeghi, Arash Fayyazi, Suhas Somashekar, Massoud Pedram
Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad