Gradient Accumulation

Gradient accumulation (GA) is a technique used in deep learning to train large models on limited GPU memory by accumulating gradients from multiple smaller batches before performing a single weight update. Current research focuses on applying GA to various architectures, including vision transformers (like Swin Transformers) and transformer-based models for natural language processing, as well as within specific optimization algorithms such as Adam. While GA can improve training efficiency for some models, its effectiveness varies depending on the architecture and task, with some studies showing performance degradation or increased training time. The overall goal is to enable training of larger and more complex models without being constrained by hardware limitations, impacting fields like medical image analysis and histopathology where large datasets are common.

Papers