Attention

Problem: Sequence to Sequence (Translation)

Pasted image 20241202225036.png

Input sequence bottlenecked through fixed-sized vector. What if T = 1000?

Idea:

  • Use new context vector at each step of decoder
  • Use attention
    1. Compute (scalar) alignment scores
      • et,i=fatt(st1,hi) (fatt is an MLP)
    2. Normalize alignment scores to get attention weights
    3. Compute context vector as linear combination of hidden states
      • ct=iat,ihi
    4. Use context vector in decoder
      • st=gu(yt1,ht1,ct)

Pasted image 20241202225741.png

Timestep = 2

Pasted image 20241202225925.png

Use a different context vector in each timestep of decoder

  • Input sequence not bottlenecked through single vector
  • At each timestep of decoder, context vector “looks at” different parts of the input sequence

Image Captioning with RNNs and Attention

Problem

Input: Image
Output: Sequence y=y1,y2,,yt (caption)

Steps

  1. Extract spatial features form a pretrained CNN
  2. Compute alignment scores
  3. Normalize alignment scores to get attention weights
    1. Softmax
  4. Compute context vector as linear combination of hidden states
  5. Use context vector in decoder

Timestep = 1

Pasted image 20241202231047.png

Timestep = 2

Pasted image 20241202231117.png

Timestep = 4

Pasted image 20241202231146.png

General Attention Layer

Inputs

Query vector: q (Shape DQ)
Input vectors: X (Shape: Nx×Dx)
Similarity function: fatt

Computation

Alignment: ei=fatt(q,xi) (Shape: Nx)
Attention: a=softmax(e) (Shape: Nx)
Output: c=iaixi (Shape: Dx)

Outputs

Context vector: c (Shape Dc)

Changes for generalization

Self-Attention Layer