Recurrent Neural Networks

So far: “Feedforward” Neural Networks

Pasted image 20241202210814.png

What if we want other types of networks

What about one to man?
Pasted image 20241202210846.png
What about many to one?
Pasted image 20241202210859.png
What about many to many?
Pasted image 20241202211028.png

Machine translation: sequence of words → sequence of words

Key Idea

RNNs have an “internal state” that is updated as a sequence is processed

We can process a sequence of vectors x by applying a recurrence formula at every time step:

h_{t} = f_{w} (h_{t - 1}, x_{t})

$h_{t}$ = new state
$f_{w}$ = some function with parameters $W$
$h_{t - 1}$ = old state
$x_{t}$ = input vector at some time step

Vanilla Recurrent Neural Networks

Pasted image 20241202213640.png

h_{t} = \tanh (W_{h h} h_{t - 1} + W_{x h} x_{t})

y_{t} = W_{h y} h_{t}

RNN Computational Graph

Initial hidden state

Either set to all 0
Or learn it

Re-use the same weight matrix at every time-step

What of different timesteps have different weights?

Can only predict fixed input length
- depends on number of weight matrices
Model size increase linearly with number of timesteps
Different weights applied on different timesteps
- difficult to learn weights

Pasted image 20241202213955.png

Many to Many

Pasted image 20241202214152.png

Many to One

Pasted image 20241202214315.png

One to Many

Pasted image 20241202214328.png

Sequence to Sequence (Machine translation)

Many to One + One to Many

Encoder + Decoder

Example: Language Modeling

Given characters 1, 2, …, t, model predicts character t

Predicts at each time-step

At test-time, generate new text:

sample characters one at a time
feed back to model

So far: encode inputs as one-hot-vector

Pasted image 20241202215027.png
Matrix multiply with a one-hot-vector just extracts a column

Often extract this into a separate embedding layer

Backpropagation Through Time

Pasted image 20241202215127.png

Take a lot of memory for long sequences!

Truncated Backpropagation Through Time

Pasted image 20241202215355.png

Tldr

Run forward and backward through chunks of the sequence instead of whole sequence

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

RNN Tradeoffs

Advantages

Same weights applied on every timestep
Can process any length input
Computation for step t can use information from many steps back
Model size doesn’t increase for longer input
Disadvantages
Recurrent computation is slow
In practice, difficult to access information from many steps back

Example: Image Captioning

Pasted image 20241202220712.png

Take feature vector coming out of CNN
Feed into RNN

Result:

We have a START token for the beginning of the predict and an END token to know when to end