Stochastic Gradient Descent

Approximate sum using a minibatch of examples 32/64/128 commons

# vanilla minibatch gradient descent

while True:
	data_batch = sample_training_data(data, 256) # sample 256 examples
	weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
	weights += - step_size * weights_grad

SGD

x_{t + 1} = x_{t} - α \nabla f (x_{t})

Problems

Loss function has high condition number

Pasted image 20241130234104.png

step size is too big, overshooting
small step size converges too slow

What if loss function has a local minimum or saddle point?

Zero gradient at both points so gradient descent might get stuck
Pasted image 20241130234615.png

Our gradients come from minibatches so they can be noisy!

Pasted image 20241130234833.png

Solution

SGD + Momentum

Build up velocity as a running mean of gradients
Rho gives friction; typically rho = 0.9 or 0.99

\begin{aligned} v_{t + 1} & = p v_{t} + \nabla f (x_{t}) \\ x_{t + 1} & = x_{t} - α v_{t + 1} \end{aligned}

v = 0
for t in range(num_steps):
	dw = compute_gradient(w)
	v = rho * v + dw
	w -= learning_rate * v

Success

Carry us over local minima and saddle points
Smooth out noise from batches