Activation Function

2-layer Neural Network:

f = W_{2} max (0, W_{1} x)

max is the activation function
The function $R e L U (z) = m a x (0, z)$ is called “Rectified Linear Unit”

What if we build a neural network with no activation function?

$s = W_{2} W_{1} x$
$W_{3} = W_{2} W_{1} \in R^{C \times H}$
So we get, $s = W_{3} x$
This is a linear classifier

Examples of Activation Functions

Pasted image 20241201115747.png

ReLU is a good default choice

Space warping

Consider a linear transform: h = Wx where x, h are both 2-dimensional

Not linearly separable in original space or feature space

Consider a neural net hidden layer:
h = ReLU(Wx) = max(0, Wx) where x, h are both 2-dimensional

Not linearly separable in original space but linearly separable in feature space

Pasted image 20241201120736.png
Pasted image 20241201120755.png

Sigmoid

Pasted image 20241202142410.png

σ (x) = 1 / (1 + e^{- x})

Squashes number to range [0,1]
Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron

3 Problems

Saturate neurons “kill” the gradients
Flat areas kill the gradient
Sigmoid outputs are not zero-centered
Always all positive or all negative
exp() is a bit compute expensive

Tanh

Shifted sigmoid

Pasted image 20241202143625.png

Squashes numbers to range [-1, 1]
zero centered (nice)
still kills gradients when saturated

ReLU (Rectified Linear Unit)

Pasted image 20241202143737.png

Does not saturate (in +ve region)
Very computationally efficient
Converges much faster

Problems

Not zero-centered output

What happens when x < 0?

Gradient will be identically 0

Learning cannot proceed

Dead ReLU

will never activate/update

Sometimes initialize ReLU neurons with slightly positive bias

Leaky ReLU

Pasted image 20241202144437.png

Does not saturate
Computationally efficient
Converges much fast
Will not “die”
0.01 is a Hyperparameters

Parametric ReLU (PReLU)

f (x) = m a x (α x, x)

Exponential Linear Unit (ELU)

Pasted image 20241202145325.png

All benefits of ReLU
Closer to zero mean outputs
Negative saturation regime compared with Leaky ReLU
- Adds some robustness to noise

Computation requires exp()

Scaled Exponential Linear Unit (SELU)

Pasted image 20241202145336.png

Scaled version of ELU that works better for deep networks
“Self-Normalizing” property
- Can train deep SELU networks without Batch Normalization

Comparison

Pasted image 20241202145850.png

Inconsistent results

Just use ReLU

Try out Leaky ReLU / ELU / SELU / GELU if you need to squeeze that last 0.1%
Don’t use sigmoid or tanh