Model Checkpoint
Model checkpointing, the process of saving intermediate states of a model during training, is crucial for large language models (LLMs) and other deep learning models, enabling fault tolerance, efficient hyperparameter optimization, and facilitating model reuse and merging. Current research focuses on optimizing checkpointing efficiency for various architectures, including Mixture-of-Experts (MoE) models, through techniques like partial checkpointing, asynchronous saving, and compression. These advancements are vital for reducing the substantial computational and storage costs associated with training and deploying increasingly large models, impacting both research reproducibility and practical applications in various fields.
Papers
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion
Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou