Inference Speed
Inference speed, the time taken for a machine learning model to process input and produce output, is a critical factor limiting the deployment of powerful models in resource-constrained environments and real-time applications. Current research focuses on optimizing various model architectures, including transformers and diffusion models, through techniques like knowledge distillation, model pruning, parallel decoding, and early exiting, aiming to significantly reduce latency without sacrificing accuracy. These advancements are crucial for expanding the practical applications of large language models, computer vision systems, and other computationally intensive AI systems across diverse platforms, from smartphones to embedded devices.
Papers
Efficient Motion Prediction: A Lightweight & Accurate Trajectory Prediction Model With Fast Training and Inference Speed
Alexander Prutsch, Horst Bischof, Horst Possegger
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet
FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung
EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed
Ziyang Zhuang, Chenfeng Miao, Kun Zou, Ming Fang, Tao Wei, Zijian Li, Ning Cheng, Wei Hu, Shaojun Wang, Jing Xiao