Attention

Problem: Sequence to Sequence (Translation)

Pasted image 20241202225036.png

Input sequence bottlenecked through fixed-sized vector. What if T = 1000?

Idea:

Use new context vector at each step of decoder
Use attention
1. Compute (scalar) alignment scores
  - $e_{t, i} = f_{a t t} (s_{t - 1}, h_{i})$ ( $f_{a t t}$ is an MLP)
2. Normalize alignment scores to get attention weights
  - $0 < a_{t, i} < 1$ , $\sum_{i} a_{t, i} = 0$
  - Softmax
3. Compute context vector as linear combination of hidden states
  - $c_{t} = \sum_{i} a_{t, i} h_{i}$
4. Use context vector in decoder
  - $s_{t} = g_{u} (y_{t - 1}, h t - 1, c_{t})$

Pasted image 20241202225741.png

Timestep = 2

Pasted image 20241202225925.png

Use a different context vector in each timestep of decoder

Input sequence not bottlenecked through single vector
At each timestep of decoder, context vector “looks at” different parts of the input sequence

Image Captioning with RNNs and Attention

Problem

Input: Image
Output: Sequence $y = y_{1}, y_{2}, \dots, y_{t}$ (caption)

Steps

Extract spatial features form a pretrained CNN
Compute alignment scores
Normalize alignment scores to get attention weights
1. Softmax
Compute context vector as linear combination of hidden states
Use context vector in decoder

Timestep = 1

Pasted image 20241202231047.png

Timestep = 2

Pasted image 20241202231117.png

Timestep = 4

Pasted image 20241202231146.png

General Attention Layer

Inputs

Query vector: q (Shape $D_{Q}$ )
Input vectors: X (Shape: $N_{x} \times D_{x}$ )
Similarity function: $f_{a t t}$

Computation

Alignment: $e_{i} = f_{a t t} (q, x_{i})$ (Shape: $N_{x}$ )
Attention: $a = softmax (e)$ (Shape: $N_{x}$ )
Output: $c = \sum_{i} a_{i} x_{i}$ (Shape: $D_{x}$ )

Outputs

Context vector: c (Shape $D_{c}$ )

Changes for generalization

Use scaled dot product for alignment
- Large similarities will cause softmax to saturate and give vanishing gradients
- Divide by $\sqrt{D}$ to reduce effect of large magnitude vectors
Multiple query vectors
Separate key and value

Attention

Problem: Sequence to Sequence (Translation)

Timestep = 2

Image Captioning with RNNs and Attention

Problem

Steps

Timestep = 1

Timestep = 2

Timestep = 4

General Attention Layer

Inputs

Computation

Outputs

Changes for generalization

Self-Attention Layer