Object Detection
Type of Computer Vision Tasks
Classification
Object Detection & Instance Detection
Object Detection
Input: Single RGB Image
Output: A set of detected objects;
- For each object predict:
- Category label (From fixed, known set of categories)
- Bounding Box (Four numbers: x, y ,width, height)
Challenges
Multiple outputs: Need to output variable numbers of objects per image
Multiple types of output: Need to predict category label and bounding box
Large images: Classification works at 224x224
- Need higher resolution for detection
Detecting a single object
Classification + Localization
Two branches
- Multiple loss functions
- Combine since we only want one loss function for Gradient Descent
Detecting multiple images
Need different numbers of outputs per image
Sliding Window
Apply a CNN to many different crops of the image
- CNN classifies each crop as object or background
Very computationally expensive
Region Proposals: Selective Search
- Find small set of boxes that are likely to cover all the objects
- “blobby” image regions
- Relatively fast to run
- Gives 2000 region proposal in a few seconds on CPU
R-CNN: Region-Based CNN
Steps
- Run region proposals
- Get regions of interest
- For each region proposals
- Warp region to fix size
- Run independently through CNN
- Classify each region
- Bounding box regression
Test time
Input: Single RGB image
- Run region proposal method to compute ~2000 region proposals
- Resize each region to 224x224 and run independently through CNN to predict class scores and boundary box transform
- Use scores to select a subset of region proposals to output
- Compare with ground-truth boxes
Comparing Boxes: Intersection over Union
How can we compare out prediction to the ground-truth box?
IoU > 0.5 is “decent”
IoU > 0.7 is “pretty good”
IoU > 0.9 is “almost perfect”
Overlapping Boxes: Non-Max Suppression (NMS)
Problem: Object detectors often output many overlapping detections
Solution: Post-process raw detections using NMS
- Select next highest-scoring box
- Eliminate lower-scoring boxes with IoU > threshold (e.g. 0.7)
- If any boxes remain, go to step 1
Fast R-CNN
Problem with Slow R-CNN
Very Slow! Need to do ~2k forwards passes for each image!
Swap the order
Faster because it can share computation between image proposal regions
- Swapped CNN with cropping + warping
What does crop features mean?
Region of Interest Pooling
- Project proposal onto features
- “Snap” to grid cells
- Max pool within each subregion
Problem
Misalignment when we snap to grid cells
RoI Align
No “snapping”
Speed Improvement
We should use CNN!!!
Faster R-CNN
Insert Region Proposal Network to predict proposals from features
Otherwise same as Fast R-CNN
- Crop features for each proposal
- Classify each one
Region Proposal Network
- Run backbone CNN to get features aligned to input image
- Image an anchor box of fixed size at each point in the feature map
- At each point, predict whether the corresponding anchor contains an object
- For positive boxes, also predict a box transform to regress from anchor box to object box
- For positive boxes, also predict a box transform to regress from anchor box to object box
We want to use anchor boxes of different sizes and shapes
Jointly train with 4 losses
- RPN Classification: Anchor box is object / not object
- RPN Regression: predict transform from anchor box to proposal box
- Object Classification: Classify proposals as background / object
- Object Regression: Predict transform from proposal box to object box
SPEEDDDDD
Two-stage Object Detector
First stage: Run once per image
- Backbone network
- Region proposal network
Second stage: Run once per region - Crop features: RoI pool / align
- Predict object class
- Prediction bbox offset
Do we need the second stage?
Single-Stage Object Detection
Two stage more accurate but slow
Single stage faster but not as accurate