Training Neural Networks
Overview
- One time setup
- Activation functions
- Data pre-processing
- Weight Initialization
- Regularization
- Training Dynamics
- Learning rate schedules
- Large-batch training
- Hyperparameters optimization
- After training
- Model ensembles
- Transfer learning
Activation function
Data Preprocessing
- Transform data for efficient training
Weight Initialization
Regularization
Common:
- Dropout
- Consider if there are large fully-connected layers
- Batch Normalization
- Almost always a good idea
- Data Augmentation
- Cutout
- Cut out parts of the image
- Mixup
- Mix up two images
Try cutout and mixup for small classification datasets
Learning Rate
SGD, SGD + Momentum, Adagrad, RMSProp, Adam all have learning rate as a Hyperparameters
Early Stopping
How long to train?