Recurrent Neural Networks
So far: “Feedforward” Neural Networks
What if we want other types of networks
What about one to man?
What about many to one?
What about many to many?
- Machine translation: sequence of words → sequence of words
Key Idea
RNNs have an “internal state” that is updated as a sequence is processed
We can process a sequence of vectors x by applying a recurrence formula at every time step:
Vanilla Recurrent Neural Networks
RNN Computational Graph
Initial hidden state
- Either set to all 0
- Or learn it
Re-use the same weight matrix at every time-step
What of different timesteps have different weights?
- Can only predict fixed input length
- depends on number of weight matrices
- Model size increase linearly with number of timesteps
- Different weights applied on different timesteps
- difficult to learn weights
Many to Many
Many to One
One to Many
Sequence to Sequence (Machine translation)
Many to One + One to Many
- Encoder + Decoder
Example: Language Modeling
Given characters 1, 2, …, t, model predicts character t
- Predicts at each time-step
At test-time, generate new text:
- sample characters one at a time
- feed back to model
So far: encode inputs as one-hot-vector
Matrix multiply with a one-hot-vector just extracts a column
- Often extract this into a separate embedding layer
Backpropagation Through Time
Take a lot of memory for long sequences!
Truncated Backpropagation Through Time
Tldr
Run forward and backward through chunks of the sequence instead of whole sequence
Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
RNN Tradeoffs
Advantages
- Same weights applied on every timestep
- Can process any length input
- Computation for step t can use information from many steps back
- Model size doesn’t increase for longer input
Disadvantages - Recurrent computation is slow
- In practice, difficult to access information from many steps back
Example: Image Captioning
- Take feature vector coming out of CNN
- Feed into RNN
Result:
We have a START token for the beginning of the predict and an END token to know when to end