Paper ID: 2502.14837 • Published Feb 20, 2025
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui
Fudan University•East China Normal University•Hikvision Inc•Shanghai Al Lab
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Multi-head Latent Attention (MLA) is an innovative architecture proposed by
DeepSeek, designed to ensure efficient and economical inference by
significantly compressing the Key-Value (KV) cache into a latent vector.
Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its
variants such as Grouped-Query Attention (GQA) exhibit significant cost
disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA
without pre-training from scratch is both meaningful and challenging. This
paper proposes the first data-efficient fine-tuning method for transitioning
from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE,
we remove RoPE from dimensions of queries and keys that contribute less to the
attention scores, for low-rank approximation, we introduce joint SVD
approximations based on the pre-trained parameters of keys and values. These
carefully designed strategies enable MHA2MLA to recover performance using only
a small fraction (0.3% to 0.6%) of the data, significantly reducing inference
costs while seamlessly integrating with compression techniques such as KV cache
quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%,
with only a 0.5% drop in LongBench performance.
Figures & Tables
Unlock access to paper figures and tables to enhance your research experience.