Videos
A video is a sequence of images
4D tensor: T x 3 x H x W
Video Classification
Fixed size of categories like Image Classification
In video classification, we want to recognize actions
Problem: Videos are big!
Videos are ~30 frames per second
Size of uncompressed video (3 bytes per pixel)":
- SD (640 x 480): ~1.5 GB per minute
- HD (1920 x 1080): ~10 GB per minute
- Low fps
- Low spatial resolution
- e.g. T=16, H=W=112
- 3.2 seconds at 5 fps, 588KB
- e.g. T=16, H=W=112
Testing
Run model on different clips, average predictions
Single-Frame CNN
Idea
Train normal 2D CNN to classify each video frames independently!
- Average predicted probs at test-time
Often a very strong baseline for video classification
Late Fusion
Intuition
Get high-level appearance of each frame and combine them
With FC Layers
With Pooling
Less learnable parameters compared to FC
Early Fusion
Intuition
Compare frames with very first conv layer
- after that normal 2D CNN
First 2D convolution collapses all temporal information
3D CNN (Slow Fusion)
Intuition
Use 3D versions of convolution and pooling to slowly fuse temporal information over the course of the network
Early Fusion vs Late Fusion vs 3D CNN
Late Fusion
- Build slowly in space
- All-at-once in time at end
Early Fusion
- Build slowly in space
- All-at-once in time at start
3D CNN
- Build slowly in space
- Build slowly in time
Accuracy
2D Conv vs 3D Conv
C3D: The VGG of 3D CNNs
3D CNN that uses all 3x3x3 conv and 2x2x2 pooling
- Except Pool1 with 1x2x2
- AlexNet: 0.7GFLOP
- VGG-16: 13.6 GFLOP
- C3D: 39.5 GFLOP
Measuring Motion: Optical Flow
Optical flow gives a displacement field F between images
- Tell where each pixel will move in the next frame
- Vertical and horizontal flow
Two-Stream Networks
Separating Motion and Appearance
- Top stream processes appearance of input video
- Lower stream is motion stream
- Processing only motion
- Processing only motion
Performance
Modeling long-term temporal structure
We can handle sequences with Recurrent Neural Networks!
Inside CNN: Each value is a function of a fixed temporal window
Inside RNN: Each vector is a function of all previous vectors
Recurrent Convolutional Network
RNNs cannot be parallelized
Self-Attention
Inflating 2D Networks to 3D (I3D)
Idea
Take a 2D CNN architecture
- Replace each 2D
conv/pool layer with a 3D version
Original
Updated
Transfer Weights
Can use weights to 2D conv to initialize 3D conv
- Copy
times in space and divide by - This gives the same result as 2D conv given “constant” video input