Choosing Hyperparameters

With tons of GPUs
Choose several values for each hyperparameter

Often space choices log-linearly
Example:

  • Weight decay: [1×104,1×103,1×102,1×101]
  • Learning rate: [1×104,1×103,1×102,1×101]

Evaluate all possible choices on this hyperparameter grid

This is expensive!!!

Choose several values for each hyperparameter

Often space choices log-linearly
Example:

  • Weight decay: log-uniform on [1×104,1×101]
  • Learning rate: log-uniform on [1×104,1×101]

Run many different trials

Comparison

More samples for each hyperparameter individually
Pasted image 20241202170731.png

Choosing Hyperparameters

Without tons of GPUs

Step 1: Check initial loss

Turn off weight decay, sanity check loss at initialization

Step 2: Overfit a small sample

Try to train to 100% training accuracy on a small sample of training data

Step 3: Find Learning Rate that makes loss go down

Use the architecture from the previous step

Step 4: Coarse grid, train for ~1-5 epochs

Choose a few values of learning rate and weight decay around what worked from Step 3

Step 5: Refine grid, train longer

Pick best models from Step 4

Step 6: Look at learning curves

Loss and accuracy curves
Pasted image 20241202180705.png

Step 7: Go to step 5

Scenarios

Things are going well, just train longer
Pasted image 20241202181134.png
Overfitting
Pasted image 20241202181203.png
Underfitting
Pasted image 20241202181225.png