Self-Supervised Learning
- Solve “pretext” tasks that produce good features for downstream tasks
- Learn with supervised learning objectives
- Labels of these pretext tasks are generated automatically
What is a good feature representation that we want to learn?
- Transferable to different downstream tasks
- Generalizable to different image domains
- We are interested in learning “semantic features”
What should be good “pretext tasks” for learning the good features?
- The pretext tasks do not require extra annotations
Self-supervised pretext tasks
Example: Learn to predict image transformations / complete corrupted images
- Solving the pretext tasks allow the model to learn good features
- We can automatically generate labels for the pretext tasks
How to evaluate a self-supervised learning method?
We usually don’t care about the performance of the self-supervised learning tasks
- We don’t care if the model learns to predict image rotation perfectly
Evaluate the learned feature encoders on downstream target tasks
- Learn good feature extractors from self-supervised pretext tasks
- Attach a shallow network on the feature extractor
- train the shallow network on the target task with small amount of labeled data
Pretext Task
Predict Rotations
Hypothesis
A model could recognize the correct rotation of an object only if it has the “visual common-sense” of what the object should look like unperturbed
- Self-supervised learning by rotating the entire input images
- The model learns to predict which rotation is applied
Predict Relative Patch Locations
Solving Jigsaw Puzzles
Predict Missing Pixels (Inpainting)
Learning to inpaint by reconstruction
Learning Features from Colorization: Split-brain Autoencoder
Idea: Cross–channel predictions
Image Coloring
Concatenate (L,ab) channels
Video Coloring
Idea: Model the temporal coherence of colors in videos
Hypothesis: Learning to color video frames should allow model to learn to track regions or objects without labels
Learning Objective:
Establish mappings between reference and target frames in a learned feature space
Use the mapping as “pointers” to copy the correct color
Tracking emerges from colorization
Propagate segmentation masks using learned attention
Propagate pose keypoints using learned attention
Problem: Learned representations may not be general
Contrastive Representation Learning
We want:
Given a chosen score function, we aim to learn an encode function
A Formulation of contrastive learning
Loss function given 1 positive sample and N-1 negative samples:
Cross entropy loss for a N-way Softmax Classifier!
- Learn to find the positive sample from the N samples
SimCLR: A Simple Framework for Contrastive Learning
Cosine similarity as the score function:
Use a projection network
Generate positive sample through data augmentation:
- Random cropping, random color distortion, and random blur
- e.g. positive samples
Mini-batch Training
Design Choices
projection head
Linear / non-linear projection heads improve representation learning
- By leveraging the projection head
, more information can be preserved in the representation space
Large batch size
Momentum Contrastive Learning (MoCo)
Motivation
- Self-supervised learning can be thought of as building dynamic dictionaries. An encoded “query” should be similar to its matching key and dissimilar to others
- Dictionaries should be large (large number of negative samples better sample the underlying continuous, high dimensional visual space) and consistent (keys in the dictionary should be represented by the same or similar encoder so that their comparisons to the query are consistent)
Key differences to SimCLR
- Keep a running queue of keys
- Compute gradients and update the encoder only through the queries
- Decouple min-batch size with the number of keys
- Can support a large number of negative samples
- The key encoder is slowly progressing through the momentum update rules
MoCo V2
A hybrid of ideas from SimCLR and MoCo:
From SimCLR: Non-linear projection head and strong data augmentation
From MoCo: Momentum-updated queues that allow training on a large number of negative samples
Masked Autoencoder
Idea:
- Mask some patches
- Predict masked patches
Random masking up to 75% of patches:
- Random mask avoids center bias
- Create challenging pre-task for model to learn the image semantics
- Efficient computation
Encoding Visible Patches:
- Encoder: Vision Transformers (ViT)
- Linear projection for mapping pixel values to patch embeddings
- Positional embeddings are added to the patch embeddings
Decoding Masked Patches
- Decoder: ViT
- Decoders makes prediction based on
- Visible patches
- Masked tokens
- Position embeddings are added to all patches
- 16 x 16 pixels = a 256-dim vector
- MSE for loss training
The decoder is throwed after pre-training and full patches are encoded during downstream fine-tuning