Activation Function

2-layer Neural Network: f=W2max(0,W1x)

max is the activation function
The function ReLU(z)=max(0,z) is called “Rectified Linear Unit”

Examples of Activation Functions

Pasted image 20241201115747.png

ReLU is a good default choice

Space warping

Consider a linear transform: h = Wx where x, h are both 2-dimensional

Consider a neural net hidden layer:
h = ReLU(Wx) = max(0, Wx) where x, h are both 2-dimensional

Pasted image 20241201120736.png
Pasted image 20241201120755.png

Sigmoid

Pasted image 20241202142410.png

σ(x)=1/(1+ex)
3 Problems

  1. Saturate neurons “kill” the gradients
  2. Flat areas kill the gradient
  3. Sigmoid outputs are not zero-centered
  4. Always all positive or all negative
  5. exp() is a bit compute expensive

Tanh

Shifted sigmoid

Pasted image 20241202143625.png

ReLU (Rectified Linear Unit)

Pasted image 20241202143737.png

Problems

  • Not zero-centered output

What happens when x < 0?

Gradient will be identically 0

  • Learning cannot proceed

Dead ReLU

  • will never activate/update
Sometimes initialize ReLU neurons with slightly positive bias

Leaky ReLU

Pasted image 20241202144437.png

Parametric ReLU (PReLU)

f(x)=max(αx,x)

Exponential Linear Unit (ELU)

Pasted image 20241202145325.png

Computation requires exp()

Scaled Exponential Linear Unit (SELU)

Pasted image 20241202145336.png

Comparison

Pasted image 20241202145850.png

Just use ReLU

  • Try out Leaky ReLU / ELU / SELU / GELU if you need to squeeze that last 0.1%
  • Don’t use sigmoid or tanh