2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size \(\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t\) early in training by using lower values for the learning rate \(\eta_t\). In this work we argue that warmup benefits training by keeping the overall size of \(\Delta \mathbf{w}_t\) limited, counteracting large initial values of \(\mathbf{u}_t\). Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates \(\mathbf{u}_t\) too large? We analyze different metrics for the update size including the \(\ell_2\)-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize \(\mathbf{u}_t\) based on the aforementioned metrics.

          Related collections

          Author and article information

          Journal
          31 October 2024
          Article
          2410.23922
          1484507a-dc10-4a4a-83dd-9cd7399cfd4f

          http://arxiv.org/licenses/nonexclusive-distrib/1.0/

          History
          Custom metadata
          Accepted to NeurIPS 2024
          cs.LG

          Artificial intelligence
          Artificial intelligence

          Comments

          Comment on this article