This is the third story for R-CNN series. You may understand more about R-CNN and Faster R-CNN from stories. Faster R-CNN (Region-based Convolutional Neural Network) is formed by two networks which are region proposal network (RPN) and objection detection.
This story will discuss Faster R-CNN (Ren et al., 2015) and the following sections are :
- The architecture of Faster R-CNN
- Region Proposal Network (RPN)
- Model Training
Giving an image to RPN, the expected output is a bunch of bounding boxes. While those bounding boxes will pass to the objection detection network to classify the object.
Unlike R-CNN and Faster R-CNN, region proposals do not generate by
selective search but extracting from feature maps directly. RPN (Region Proposal Network) is introduced to tackle this challenge.
Ren et al. also defined bounding boxes as anchors. Default anchors are three scales (128×128, 256×256, and 512×512) and three ratios (1:1, 1:2 and 2:1), so we got 9 anchors in total. The size of scale, ratio, and the number of anchors are depending on your objective. For example, the scale of recognizing faces from images may not need very big most of the time.
Ren et al. introduce the anchor concept for RPN. An Anchor is centered at the sliding window. It is designed so that there are multiple boxes sharing the same center. Coordinates and probability of object or not are the output of this layer. In other words, there are sub-network within Anchor.
reg layer predicts coordinates (4 coordination per
cls layer predicts whether it is an object or not.
The number of maximum possible proposals (i.e. k) is configurable and 3 is applied in the experiments. Scale and aspect ratio is configurable as well while 3 scales and 3 aspect ratios are default values. So maximum possible proposals are 9 anchors in total.
Binary class label are prepared for training a
cls layer. Anchors treat as a positive label if both conditions are fulfilled:
- The highest Intersection-over-Union (IoU) overlap rate between the anchor and group-truth box
- The overlapping rate is higher than 0.7
Negative label condition:
- The overlapping rate is lower than 0.3
How about the rest of anchors (between 0.3 and 0.7 and non-highest overlapping rate)? It will be thrown away to prevent negative performance from the training objective.
Combing RPN and Fast R-CNN Object Detection
Ren et al. introduced 3 ways for training sharing network layers.
Alternating training: Train RPN and using the region proposals to train Fast R-CNN iteratively. Ren et al. apply this method in their paper and experiments.
Approximate joint training: Train RPN and Fast R-CNN as a single network and updating the sharing layer by backward propagation. Ren et al. found that it produces less error and reducing training time by about 25%-50%. It is implemented in the released python code.
Non-approximate joint training: The
approximate joint trainingignore the gradients of the box coordinates. It can be solved by using Region-of-Interest pooling (RoIPool).
Region Proposals Network (RPN) delivers a better result with faster performance.
From the below figures, it is noticed that 3 scales with 3 aspect ratios deliver a better result.
It is mentioned to use 0.7 as a threshold to design whether anchor belongs to the positive label in
cls layer or not. The following Recall-IoU overlap ration figure shows that 0.7 is the most balance point.
- Comparing Faster R-CNN and Fast R-CNN, later uses selective search to generate region proposals which leading slow performance.
- The tricks to improving speed is sharing the network between RPN and object detection phases.
I am Data Scientist in Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. You can reach me from Medium Blog, LinkedIn or Github.
Ren S., He K., Girshick R. and Sun J.. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. 2015.
Faster R-CNN in python