Vision and Language

Why Vision + Language?

Vision-language Joint Embedding

What is a vision-language joint embedding space?

Pasted image 20241207001251.png

Why vision-language joint embedding?

  1. Enrich training samples & labels for visual recognition model
    1. Samples: annotated images are considerably limited VS. image-text pairs
    2. Labels: Generalizing to unseen object categories
  2. Image-text / text-image retrieval
  3. Useful for downstream vision-language tasks

How is Vision-Language Joint Embedding Learned?

CLIP

Rather than needing handcrafted labels to train a good classifier for a given domain, we can leverage free-form text from the internet to learn a model that is a good classifier for all domains

Training

Inference

Evaluation and Applications

ImageBind

Bridging More Modalities

Image Captioning

Encoder-decoder structure

Pasted image 20241207002458.png

Video Captioning

Scene graph

Scene graph is a structured way to represent information from images
Pasted image 20241207003339.png

Visual Question Answering

Answering open-ended questions about images which require an understanding of vision, language, and common sense knowledge

VQA as image + text → text

Pasted image 20241207003528.png

VQA as classification

Pasted image 20241207003556.png

Bottom-up and top-down attention

Pasted image 20241207003643.png

Visual Grounding

Locate relevant objects in the image
Pasted image 20241207003741.png