Model Latency

Model latency, the time it takes for a model to produce an output, is a critical factor limiting the deployment of increasingly complex machine learning models, particularly on resource-constrained devices. Current research focuses on reducing latency through techniques like model pruning (removing less important parts of the model), developing more efficient architectures (e.g., binary neural networks, linearized models), and optimizing inference processes (e.g., one-step diffusion models, dynamic model switching). Addressing model latency is crucial for enabling real-time applications across diverse fields, from computer vision and speech processing to private inference and mobile AI, where low latency is essential for a positive user experience.

Papers