Activation Function

2-layer Neural Network: f=W2max(0,W1x)

max is the activation function
The function ReLU(z)=max(0,z) is called “Rectified Linear Unit”

Examples of Activation Functions

Pasted image 20241201115747.png

ReLU is a good default choice

Space warping

Consider a linear transform: h = Wx where x, h are both 2-dimensional

Consider a neural net hidden layer:
h = ReLU(Wx) = max(0, Wx) where x, h are both 2-dimensional

Pasted image 20241201120736.png
Pasted image 20241201120755.png


Pasted image 20241202142410.png

3 Problems

  1. Saturate neurons “kill” the gradients
  2. Flat areas kill the gradient
  3. Sigmoid outputs are not zero-centered
  4. Always all positive or all negative
  5. exp() is a bit compute expensive


Shifted sigmoid

Pasted image 20241202143625.png

ReLU (Rectified Linear Unit)

Pasted image 20241202143737.png


  • Not zero-centered output

What happens when x < 0?

Gradient will be identically 0

  • Learning cannot proceed

Dead ReLU

  • will never activate/update
Sometimes initialize ReLU neurons with slightly positive bias

Leaky ReLU

Pasted image 20241202144437.png

Parametric ReLU (PReLU)


Exponential Linear Unit (ELU)

Pasted image 20241202145325.png

Computation requires exp()

Scaled Exponential Linear Unit (SELU)

Pasted image 20241202145336.png


Pasted image 20241202145850.png

Just use ReLU

  • Try out Leaky ReLU / ELU / SELU / GELU if you need to squeeze that last 0.1%
  • Don’t use sigmoid or tanh