Gradient Descent

# Vanilla gradient descent

while True:
	weights_grad = evaluate_gradient(loss_fun, data, weights)
	weights += - step_size * weights_grad

step_size/learning rate is Hyperparameters

L=1NiLi(f(xi,W),yi)+λR(W)WL=1NiWLi(f(xi,W),yi)+λWR(W)

Full sum expensive when N is large!

Approximate sum using a minibatch of examples 32/64/128 commons: Stochastic Gradient Descent