Unraveling Transformer Optimization: A Hessian-Based Explanation for Adam’s Superiority over SGD
Marktechpost
SEPTEMBER 30, 2024
While the Adam optimizer has become the standard for training Transformers, stochastic gradient descent with momentum (SGD), which is highly effective for convolutional neural networks (CNNs), performs worse on Transformer models. A significant challenge in this domain is the inconsistency in optimizer performance.
Let's personalize your content