Self-Supervised Learning

Self-supervised learning aims to learn from data without manual label annotations

  • Solve “pretext” tasks that produce good features for downstream tasks
    • Learn with supervised learning objectives
    • Labels of these pretext tasks are generated automatically

What is a good feature representation that we want to learn?

What should be good “pretext tasks” for learning the good features?

Self-supervised pretext tasks

Example: Learn to predict image transformations / complete corrupted images
Pasted image 20241206213533.png

  1. Solving the pretext tasks allow the model to learn good features
  2. We can automatically generate labels for the pretext tasks

How to evaluate a self-supervised learning method?

We usually don’t care about the performance of the self-supervised learning tasks

  1. Learn good feature extractors from self-supervised pretext tasks
  2. Attach a shallow network on the feature extractor
    1. train the shallow network on the target task with small amount of labeled data

Pretext Task

Predict Rotations

Hypothesis
A model could recognize the correct rotation of an object only if it has the “visual common-sense” of what the object should look like unperturbed

Pasted image 20241206214109.png

Predict Relative Patch Locations

Pasted image 20241206214242.png

Solving Jigsaw Puzzles

Pasted image 20241206214259.png

Predict Missing Pixels (Inpainting)

Pasted image 20241206214612.png

Learning to inpaint by reconstruction

Pasted image 20241206214646.png

Learning Features from Colorization: Split-brain Autoencoder

Idea: Cross–channel predictions

Pasted image 20241206214932.png
Pasted image 20241206215019.png

Image Coloring

Pasted image 20241206215125.png
Concatenate (L,ab) channels
Pasted image 20241206215209.png

Video Coloring

Idea: Model the temporal coherence of colors in videos

Pasted image 20241206215255.png
Hypothesis: Learning to color video frames should allow model to learn to track regions or objects without labels

Learning Objective:

Establish mappings between reference and target frames in a learned feature space
Use the mapping as “pointers” to copy the correct color
Pasted image 20241206215428.png
Pasted image 20241206215500.png

Tracking emerges from colorization

Propagate segmentation masks using learned attention

Pasted image 20241206215617.png

Propagate pose keypoints using learned attention

Pasted image 20241206215641.png

Problem: Learned representations may not be general

Pasted image 20241206215748.png

Contrastive Representation Learning

Pasted image 20241206215948.png
We want:

score(f(x),f(x+))>>score(f(x),f(x))

Given a chosen score function, we aim to learn an encode function f that yields high score for positive pairs (x,x+) and low scored for negative pairs (x,x)

A Formulation of contrastive learning

Loss function given 1 positive sample and N-1 negative samples:
Pasted image 20241206220437.png
Cross entropy loss for a N-way Softmax Classifier!

SimCLR: A Simple Framework for Contrastive Learning

Cosine similarity as the score function:
Pasted image 20241206220626.png
Use a projection network g() to project features to a space where contrastive learning is applied
Pasted image 20241206220739.png
Generate positive sample through data augmentation:

Mini-batch Training

Pasted image 20241206220923.png

Design Choices

projection head

Linear / non-linear projection heads improve representation learning
Pasted image 20241206221040.png

Large batch size

Pasted image 20241206221150.png

Momentum Contrastive Learning (MoCo)

Motivation

Key differences to SimCLR

MoCo V2

A hybrid of ideas from SimCLR and MoCo:
From SimCLR: Non-linear projection head and strong data augmentation
From MoCo: Momentum-updated queues that allow training on a large number of negative samples

Masked Autoencoder

Idea:

  1. Mask some patches
  2. Predict masked patches
    Pasted image 20241206222524.png

Random masking up to 75% of patches:

Encoding Visible Patches:

Decoding Masked Patches