Paper ID: 2408.11029

Scaling Law with Learning Rate Annealing

Howe Tissue, Venus Wang, Lu Wang

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps ($s$): $$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2$$ Where $S_1$ is forward area and $S_2$ is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1\% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.

Submitted: Aug 20, 2024