Self-Attention Layer

One query per input vector

Permutation Equivariant

Positional Encoding

Make processing position-aware

Masked Self-Attention Layer

Don’t let vectors “look ahead” in the sequence

Multihead Self-Attention Layer

Use H independent “Attention Heads” in parallel

Hyperparameters

Example: CNN with Self-Attention

Pasted image 20241202235811.png