Learning Rate

Stochastic Gradient Descent, AdaGrad, Adam all have learning rate as a hyperparameter

Pasted image 20241201002807.png

Which one of these learning rates is best to use?

All of them! Start with large learning rate and decay over time

Learning Rate Decay

Pasted image 20241201003148.png
Reduce learning rate at a few fixed points.

E.g. for resNets, Multiply LR by 0,1 after epochs 30, 60, and 90.
Examples of decay:
Cosine $α_{t} = \frac{1}{2} α_{0} (1 + \cos (t π / T))$
- Two Hyperparameters
- No new hyperparameters
Linear $α_{t} = α_{0} (1 - t / T)$
Inverse sqrt $α_{t} = α_{0} / \sqrt{t}$
- Not a lot of time in high learning rate
- A lot of time in low learning rate
constant $α_{t} = α_{0}$
- Works well most of the time

Linear warmup

High initial learning rates can make loss explode

Linearly increasing learning rate from 0 over the first ~5000 iterations can prevent this
Pasted image 20241201003120.png