Model Latency

Model latency, the time it takes for a model to produce an output, is a critical factor limiting the deployment of increasingly complex machine learning models, particularly on resource-constrained devices. Current research focuses on reducing latency through techniques like model pruning (removing less important parts of the model), developing more efficient architectures (e.g., binary neural networks, linearized models), and optimizing inference processes (e.g., one-step diffusion models, dynamic model switching). Addressing model latency is crucial for enabling real-time applications across diverse fields, from computer vision and speech processing to private inference and mobile AI, where low latency is essential for a positive user experience.

Papers

December 26, 2024

Latenrgy: Model Agnostic Latency and Energy Consumption Prediction for Binary Classifiers
Jason M. Pittman
Responsible AI Inference Efficiency Binary Classifier Inference Performance Energy Prediction Responsible Machine Learning Model Latency

June 17, 2024

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint
Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose Alvarez
3D Object Detection Multi Layer Latency Constraint Model Latency Block Wise Pruning

May 29, 2024

HawkVision: Low-Latency Modeless Edge AI Serving
ChonLam Lao, Jiaqi Gao, Ganesh Ananthanarayanan, Aditya Akella, Minlan Yu
Model Inference Machine Learning Inference Model Latency

March 25, 2024

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Yuda Song, Zehao Sun, Xuanwu Yin
Diffusion Model Image Generation Latent Diffusion Model Image to Image Translation Model Size Model Latency

February 8, 2024

Linearizing Models for Efficient yet Robust Private Inference
Sreetama Sarkar, Souvik Kundu, Peter A. Beerel
High Efficiency Reverse Graph Linearization Private Inference Weight Mask Private Bayesian Model Latency ReLU Operation

August 7, 2023

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Christian Huber, Tu Anh Dinh, Carlos Mullov, Ngoc Quan Pham, Thai Binh Nguyen, Fabian Retkowski, Stefan Constantin, Enes Yavuz Ugan, Danni Liu, Zhaolin Li, Sai Koneru, Jan Niehues, Alexander Waibel
End to End Speech Translation Low Latency Translation Quality Model Latency

February 13, 2023

The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment
Jared Fernandez, Jacob Kahn, Clara Na, Yonatan Bisk, Emma Strubell
Deep Learning Framework Software Deployment Inference Efficiency NLP Research Model Architecture Health Disparity Model Latency

November 2, 2022

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames
Chengdong Liang, Xiao-Lei Zhang, BinBin Zhang, Di Wu, Shengqiang Li, Xingchen Song, Zhendong Peng, Fuping Pan
Speech Recognition End to End Frame Attention CTC Based Model Latency

July 12, 2022

Distilled Non-Semantic Speech Embeddings with Binary Neural Networks for Low-Resource Devices
Harlin Lee, Aaqib Saeed
Knowledge Distillation Binary Neural Network Resource Constrained Device Model Latency Non Semantic Speech Task

June 22, 2022

Play It Cool: Dynamic Shifting Prevents Thermal Throttling
Yang Zhou, Feng Liang, Ting-wu Chin, Diana Marculescu
Machine Learning Model Tiny Machine Learning Thermal Control Model Latency

Model Latency

Papers

Latenrgy: Model Agnostic Latency and Energy Consumption Prediction for Binary Classifiers

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint

HawkVision: Low-Latency Modeless Edge AI Serving

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Linearizing Models for Efficient yet Robust Private Inference

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

Distilled Non-Semantic Speech Embeddings with Binary Neural Networks for Low-Resource Devices

Play It Cool: Dynamic Shifting Prevents Thermal Throttling