Partial Execution

Partial execution, the strategy of performing computations on only a portion of data or a model, is gaining traction across diverse fields. Current research focuses on optimizing its application in large language model (LLM) serving, where it reduces latency by concurrently executing tools and decoding, and in deep learning for resource-constrained devices, improving memory efficiency and enabling on-device inference. This technique also shows promise in mitigating backdoor attacks in neural networks and enhancing Bayesian optimization by selectively evaluating function networks, ultimately improving efficiency and performance in various computational tasks.

Papers