Learning Rate

Stochastic Gradient Descent, AdaGrad, Adam all have learning rate as a hyperparameter

Pasted image 20241201002807.png

Which one of these learning rates is best to use?

All of them! Start with large learning rate and decay over time

Learning Rate Decay

Pasted image 20241201003148.png
Reduce learning rate at a few fixed points.

Linear warmup

High initial learning rates can make loss explode

Linearly increasing learning rate from 0 over the first ~5000 iterations can prevent this
Pasted image 20241201003120.png