04 All for nothing

About Faster R-CNN process

hyojung chang

written by hyojung chang


0. What is R-CNN? (Evolution to Fast R-CNN)

R-CNN is defined "Region Proposal + Convolutional Neural Network(= CNN)". 

The process of performing R-CNN is as follows :

1) Take an image as input

2) Extract about 2000 region proposals in which objects can exist from images (using selective search)

3) Cropping each region proposal region from the image

4) After warping the cropped images to the same size, extract feature using CNN

5) Perform classification for each region proposal feature

Since R-CNN requires CNN calculations as many as the number of region proposals per image, the inference speed is very slow and the learning process has to go through several stages in a complex manner. Therefore, Fast R-CNN is the method of performing cropping in feature map level rather than image level, reducing 2,000 CNN operations to one CNN operation.

In addition, R-CNN converted images through warping to make the different size regions all the same, but warping is not a good way because information loss occurs in the process of reducing size and changing ratio. Therefore, a method is needed to extract a certain length of feature without loss of information from the regions of various sizes of feature maps. Spatial Pyramid Pooling(= SPP) divides a image into several areas and then applies BoW(a method of extracting features of a certain size from inputs of various sizes) to each region to maintain some local information. In this case, Fast R-CNN used only single level pyramid of SPP layer, which is called RoI pooling.


1. Overview of Faster R-CNN

...

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with ‘attention’ [31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.

...omitted below

Reference : <Link>

Faster R-CNN is composed of two networks, one is the Region Proposal Network and the other is a Detector that detects objects through the proposed regions above. At this time, Region Proposal Network(= RPN) is the core idea of Faster R-CNN. Faster R-CNN inherits Fast R-CNN structure, eliminates selective search, and calculates RoI through RPN. The RPN improves accuracy and running time as well as avoids to generate excess of proposal boxes because the RPN reduces the cost by sharing computation on convolutional features. RPN and Fast R-CNN are merged into a single network by sharing their convolutional features. This combination helps Faster R-CNN to have leading performance on accuracy but leads to its architecture as a two-stage network which reduces the speed of processing of this method.


Simply, Faster R-CNN extracts the feature map first and then pass it to the RPN to calculate the RoI. RoI pooling is performed with the obtained RoI, and then classification is performed for object detection.


2. Region Proposal Network(= RPN)

The input of RPN is an image of any size and outputs a set of bounding boxes as rectangular object proposals, along with an objectness score for each proposal. Specifically, the RPN takes the image feature map of the fifth convolutional layer(=conv5) as an input and applies a 3x3 sliding window on the feature map. Then, the intermediate layer will feed into two different branches, one for object score(determines whether the region is thing or stuff) and the other for regression(determines how should the bounding box change to become more similar to the ground truth).

The process of performing RPN is as follows :

1) Receive feature maps extracted via CNN as input. At this point, size of the feature map is H x W x C. It's horizontal, vertical, and number of channels.

2) Perform 3x3 Convolution on the feature map as much as 256 or 512 channels. Corresponds to the intermediate layer in the figure above. At this time, set padding to 1 to preserve HxW. Result of performing intermediate layer is a second feature map of the size HxWx256 or HxWx512.

3) Enter the second feature map to calculate the prediction values of classification and bound box regression. It is important to note that the Full Conversion Network, which is calculated using a 1x1 convention rather than a Fully Connected Layer, has the characteristics of a Full Conversion Network. This is to ensure that it works regardless of the size of the input image, and please refer to the posting on the Full Conversion Network for more information.

4) To perform the classification, perform 1x1 convolution as many as the number of channels(= 2(number of indicators indicating object recognition)x9(number of anchors)). As a result, we get a feature map of the size HxWx18. One index on HxW represents the coordinate on the feature map, and each of the 18 channels below it contains the predicted values for whether or not k anchor boxes are objects. In other words, one 1x1 convolution makes prediction for the anchor coordinates of HxW. Now, we resample these values properly and then apply Softmax to obtain the probability value that the anchor is an object.

(Note) Why the number of anchors is 9? 3 sizes(128, 256, 512) and 3 ratios(2:1, 1:1, 1:2) = 9 combinations

5) Second, perform 1x1 convolution as many as the number of channels(= 4x9) to obtain the Bounding Box Region prediction. Since it is a regress, we will use the result value as it is.

6) Calculate the RoI from the values we got earlier. First, we sort the probability values of objects obtained through classification, and then select K anchors in the highest order. Next, apply Bounding Box Regression to each of the K anchors and apply Non-Maximum-Suppression to obtain RoI.

Explain the classification layer and regression layer in more detail,

  • Classification layer

Classify all anchors into foreground and background. If the anchor has the largest IoU(and >=0.7) compared to ground truth box, it is the foreground, and if its IoU is lower than 0.3 for all ground-truth boxes, it is the background.

Anchors that are neither positive nor negative do not contribute to the training objective.

(Note) A single ground-truth box may assign positive labels to multiple anchors.

IoU : Intersection over Union

foreground : positive anchor(= positive label)

background : non-positive anchor(= negative label)


  • Regression layer

For bounding box regression, they adopt the parameterizations of the 4 coordinates.

Where x, y, w, and h denote the box’s center coordinates and its width and height. Variables x, xa, and x* are for the predicted box, anchor box, and ground truth box respectively(likewise for y, w, h). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.

Bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. In above formulation, the features used for regression are of the same spatial size(3×3) on the feature maps.


3. Loss function for training RPNs

Loss function takes the form of combining the losses obtained from classification and bounding box regression. And also, minimize an objective function following the multi-task loss in Fast R-CNN.

i : index of anchor
pi : estimate whether anchor i is an object or background
p*i : ground-truth label, 1 means object(= positive) and 0 means background(= negative)
Lcls : log loss over 2 classes(object or background)
Ncls : mini-batch size, default 256
ti : four bounding box coordinates
t*i : ground-truth box
Lreg : loss function if object exists(use smmoth L1 loss function)
Nreg : the number of anchor locations, default 256x9
λ : default 10

             


4. Training RPNs

The process of training RPNs is as follows :

0) Prepare a pre-learned CNN M0 with ImageNet data.

1) Train RPN M1 based on M0 convolution feature map.

2) Extract region proposal P1 from images using RPN M1.

3) Get model M2 by training Fast R-CNN based on M0 using the extracted region proposal P1.

4) With all the convolution features of Fast R-CNN model M2 fixed, train RPN to obtain RPN model M3.

5) Extract region proposal P2 from images using RPN model M3.

6) Train Fast R-CNN model M4 with the convolution feature fixed on the RPN model M3.

 

Simply, RPN and Fast R-CNN are in the form of alternating training while sharing convolution feature.


5. Summary

R-CNN showed the possibility of performing high-performance object detection by combining region proposal and classification CNN. Fast R-CNN proposed RoI pooling to get a constant-length vector from the convolution feature map to compensate for the speed, which is a disadvantage of R-CNN. To improve the bottleneck caused by the region proposal algorithm, Faster R-CNN proposed a region proposal network that can directly generate region proposals from convolution feature map.


Reference : <Link><Link><Link>