Unraveling Transformer Optimization: A Hessian-Based Explanation for Adam’s Superiority over SGD
Marktechpost
SEPTEMBER 30, 2024
Large Language Models (LLMs) based on Transformer architectures have revolutionized AI development. While the Adam optimizer has become the standard for training Transformers, stochastic gradient descent with momentum (SGD), which is highly effective for convolutional neural networks (CNNs), performs worse on Transformer models.
Let's personalize your content