Vision and Language
Why Vision + Language?
- We live in a multi-modal world
- We learn from both vision and language modalities
- Allows AI to interact with human beings
- Language provides open–set and comprehensive semantic information for visual perception
Vision-language Joint Embedding
What is a vision-language joint embedding space?
Why vision-language joint embedding?
- Enrich training samples & labels for visual recognition model
- Samples: annotated images are considerably limited VS. image-text pairs
- Labels: Generalizing to unseen object categories
- Image-text / text-image retrieval
- Useful for downstream vision-language tasks
How is Vision-Language Joint Embedding Learned?
- Data: image-text pairs
- Joint embedding: Alignment!
CLIP
Rather than needing handcrafted labels to train a good classifier for a given domain, we can leverage free-form text from the internet to learn a model that is a good classifier for all domains
Training
- CLIP is trained to classify which text caption in a batch corresponds to each image in a batch
Inference
- Create new zero-shot classifiers during inference:
- fetch the text embeddings for a set of classification labels
- compute these labels’ similarity to a given image
Evaluation and Applications
- Calculate image-text similarities
- Image-text retrieval based on image-text similarities
- Image classification
- Representation learning
ImageBind
Bridging More Modalities
- One embedding space to bind all modalities
Image Captioning
- Describe the content of an image or video with a natural language sentence
Encoder-decoder structure
-
Visual Encoders
-
Visual Decoders
-
Show and Tell
- Encode the image with a CNN-based model
- Decode the image in an autoregressive fashion using an RNN (LSTM) language model
-
Bottom-up and Top-down attention
- Usually models operate on CNN features corresponding to a uniform grid
- What if we use features from object detections instead?
Video Captioning
- Video to text (sequence to sequence)
Scene graph
Scene graph is a structured way to represent information from images
Visual Question Answering
Answering open-ended questions about images which require an understanding of vision, language, and common sense knowledge
VQA as image + text → text
VQA as classification
Bottom-up and top-down attention
Visual Grounding
Locate relevant objects in the image