Stochastic Gradient Descent

Approximate sum using a minibatch of examples 32/64/128 commons

# vanilla minibatch gradient descent

while True:
	data_batch = sample_training_data(data, 256) # sample 256 examples
	weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
	weights += - step_size * weights_grad
SGD
xt+1=xtαf(xt)

Problems

Solution

SGD + Momentum

vt+1=pvt+f(xt)xt+1=xtαvt+1
v = 0
for t in range(num_steps):
	dw = compute_gradient(w)
	v = rho * v + dw
	w -= learning_rate * v