There are two types of object detection models, one stage or two stage models. A one stage model is capable of detecting objects without the need for a preliminary step. On the contrary, a two stage detector uses a preliminary stage where regions of importance are detected and then classified to see if an object has been detected in these areas. The advantage of a one stage detector is the speed it is able to make predictions quickly allowing a real time use.
YoloV4 is an important improvement of YoloV3, the implementation of a new architecture in the Backbone and the modifications in the Neck have improved the mAP(mean Average Precision) by 10% and the number of FPS(Frame per Second) by 12%. In addition, it has become easier to train this neural network on a single GPU.
We are going to detail all the layers of YoloV4, to understand how it works..
What’s the backbone for? It’s a deep neural network composed mainly of convolution layers. The main objective of the backbone is to extract the essential features, the selection of the backbone is a key step it will improve the performance of object detection. Often pre-trained neural networks are used to train the backbone.
The YoloV4 backbone architecture is composed of three parts:
- Bag of freebies
- Bag of specials
We are going to explain all these concepts in the following parts.
Bag of freebies
Bag of freebies methods are the set of methods that only increase the cost of training or change the training strategy while leaving the cost of inference low. Let’s present some simple methods commonly used in computer vision.
The main objective of data augmentation methods is to increase the variability of an image in order to improve the generalization of the model training.
1. Microsoft Azure Machine Learning x Udacity — Lesson 4 Notes
2. Fundamentals of AI, ML and Deep Learning for Product Managers
3. Roadmap to Data Science
4. Work on Artificial Intelligence Projects
The most commonly used methods are Photometric Distortion, Geometric Distortion, MixUp, CutMix and GANs.
Photometric distortion creates new images by adjusting brightness, hue, contrast, saturation and noise to display more varieties of the same image.
In the example above we adjusted the Hue(or color appearance parameter) to modify the image and create new samples to create more variability in our training set.
The so-called geometric distortion methods are all the techniques used to rotate the image, flipping, random scaling or cropping.
In the first example (i.e. figure 4), we rotated the original image by 90°. In the second example (i.e. figure 5), we performed an affine transformation of the original image.
Mixup augmentation is a type of augmentation where in we form a new image through weighted linear interpolation of two existing images. We take two images and do a linear combination of them in terms of tensors of those images. Mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.
In mixup, two images are mixed with weights: λ and 1−. λ is generated from symmetric beta distribution with parameter alpha. This creates new virtual training samples.
In image classification images and labels can be mixed up as following:
CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. CutMix improves the model robustness against input corruptions and its out-of-distribution detection performances.
Bag of freebies for fight against bias
The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000).
Usually, in classification problems cross entropy is used as a loss function to train the model. The advantage of this function is to penalize an error more strongly if the probability of the class is high.
Nevertheless, this function also penalizes true positive examples these small loss values can overwhelm the rare class.
The new Focal loss function is based on the cross entropy by introducing a (1-pt)^gamma coefficient. This coefficient allows to focus the importance on the correction of misclassified examples.
The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted. When γ = 0, FL is equivalent to CE, and as γ is increased the effect of the modulating factor is likewise in- creased
Whenever you feel absolutely right, you may be plainly wrong. A 100% confidence in a prediction may reveal that the model is memorizing the data instead of learning. Label smoothing adjusts the target upper bound of the prediction to a lower value say 0.9. And it will use this value instead of 1.0 in calculating the loss. This concept mitigates overfitting.
Most object detection models use bounding box to predict the location of an object. To evaluate the quality of a model the L2 standard is used, to calculate the difference in position and size of the predicted bounding box and the real bounding box.
The disadvantage of this L2 standard is that it minimizes errors on small objects and tries to minimize errors on large bounding boxes.
To address this problem we use IoU loss for the YoloV4 model.
Compared to the l2 loss, we can see that instead of optimizing four coordinates independently, the IoU loss considers the bounding box as a unit. Thus the IoU loss could provide more accurate bounding box prediction than the l2 loss. Moreover, the definition naturally norms the IoU to [0, 1] regardless of the scales of bounding boxes
Bag of specials
Bag of special methods are the set of methods which increase inference cost by a small amount but can significantly improve the accuracy of object detection.
Mish is a novel self-regularized non-monotic activation function which can be defined by f(x) = x tanh(softplus(x)).
Why this activation function improve the training ?
Mish is bounded below and unbounded above with a range of [≈ -0.31,∞[. Due to the preservation of a small amount of negative information, Mish eliminated by design the preconditions necessary for the Dying ReLU phenomenon. A large negative bias can cause saturation of the ReLu function and causes the weights not to be updated during the backpropagation phase making the neurons inoperative for prediction.
Mish properties helps in better expressivity and information flow. Being unbounded above, Mish avoids saturation, which generally causes training to slow down due to near-zero gradients drastically. Being bounded below is also advantageous since it results in strong regularization effects.
The Cross Stage Partial architecture is derived from the DenseNet architecture which uses the previous input and concatenates it with the current input before moving into the dense layer.
Each stage layer of a DenseNet contains a dense block and a transition layer, and each dense block is composed of k dense layers. The output of the ith dense layer will be concatenated with the input of the ith dense layer, and the concatenated outcome will become the input of the (i + 1)th dense layer. The equations showing the above-mentioned mechanism can be expressed as:
where ∗ represents the convolution operator, and [x0, x1, …] means to concatenate x0, x1, …, and wi and xi are the weights and output of the ith dense layer, respectively.
The CSP is based on the same principle except that instead of concatenating the ith output with the ith input, we divided the input ith in two parts x0′ and x0’’, one part will pass through the dense layer x0’’, the second part x0′ will be concatenated at the end with the result at the output of the dense layer of x0’’.
This translates mathematically into the following equation:
This will result in different dense layers repeatedly learn copied gradient information.
The essential role of the neck is to collect feature maps from different stages. Usually, a neck is composed of several bottom-up paths and several top-down paths.
We will explain the different elements that make up the neck of the yoloV4 is their usefulness in architecture.
What is the problem caused by CNN and fully connected network ?
The fully connected network requires a fixed size so we need to have a fixed size image, when detecting objects we don’t necessarily have fixed size images. This problem forces us to scale the images, this method can remove a part of the object we want to detect and therefore decrease the accuracy of our model.
The second problem caused by CNN is that the size of the sliding window is fixed.
How SPP runs ?
At the output of the convolution neural networks, we have the features map, these are features generated by our different filters. To make it simple, we can have a filter able to detect circular geometric shapes, this filter will produce a feature map highlighting these shapes while keeping the location of the shape in the image.
Spatial Pyramid Pooling Layer will allow to generate fixed size features whatever the size of our feature maps. To generate a fixed size it will use pooling layers like Max Pooling for example, and generate different representations of our feature maps.
I will detail the different steps carried out by an SPP.
In the case of Figure 13, we have a 3-level PPS. Suppose the conv5 (i.e. the last convolution layer) has 256 features map.
- First, each feature map is pooled to become a one value (grey part in figure 13). Then the size of the vector is (1, 256)
- Then, each feature map is pooled to have 4 values (green par in figure 13). Then the size of the vector is (4, 256)
- On the same way, each feature is pooled to have 16 values (blue part in figure 13). Then the size of the vector is (16, 256)
- The 3 vectors created in the previous 3 steps are then concatenated to form a fixed size vector which will be the input of the fully connected netw.
What are the benefits of SPP ?
- SPP is able to generate a fixed- length output regardless of the input size
- SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations
- SPP can pool features extracted at variable scales thanks to the flexibility of input scales
PaNet: for aggregate different backbone levels
In the early days of deep learning, simple networks were used where an input passed through a succession of layers. Each layer takes input from the previous layer. The early layers extract localized texture and pattern information to build up the semantic information needed in the later layers. However, as we progress to the right, localized information that may be needed to fine-tune the prediction may be lost.
To correct this problem, PaNet has introduced an architecture that allows better propagation of layer information from bottom to top or top to bottom.
The components of the neck typically flow up and down among layers and connect only the few layers at the end of the convolutional network.
We can see in figure 14, that the information of the first layer is added in layer P5 (red arrow), and propagated in layer N5 (green arrow). This is a shortcut to propagate low level information to the top.
How PaNet add information to top layers ?
In the original implementation of PaNet, the current layer and information from a previous layer is added together to form a new vector. In the YoloV4 implementation, a modified version is used where the new vector is created by concatenating the input and the vector from a previous layer.
The role of the head in the case of a one stage detector is to perform dense prediction. The dense prediction is the final prediction which is composed of a vector containing the coordinates of the predicted bounding box (center, height, width), the confidence score of the prediction and the label.
Bag of freebies (BoF)
The CIoU loss introduces two new concepts compared to IoU loss. The first concept is the concept of central point distance, which is the distance between the actual bounding box center point and the predicted bounding box center point.
The second concept is the aspect ratio, we compare the aspect ratio of the true bounding box and the aspect ratio of the predicted bounding box. With these 3 measures we can measure the quality of the predicted bounding box.
where b and bgt denote the central points of B and Bgt, ρ(·) is the Euclidean distance, c is the diagonal length of the smallest enclosing box covering the two boxes, α is a positive trade-off parameter, and v measures the consistency of aspect ratio.
where w is the height of the bounding box and w is the width.
CmBN (Cross mini Batch Normalization)
Why use Cross mini Batch Normalization instead of Batch Normalization? What are its advantages and how does it work? We will answer these questions in this paragraph.
Batch Normalization does not perform when the batch size becomes small. The estimate of the standard deviation and mean is biased by the sample size. The smaller the sample size, the more likely it is not to represent the completeness of the distribution. To solve this problem, Cross mini Batch Normalization is used, which uses estimates from recent batches to improve the quality of each batch’s estimate. A challenge of computing statistics over multiple iterations is that the network activations from different iterations are not comparable to each other due to changes in network weights. To solve this problem, Taylor polynomials are used to approximate any indefinitely differentiable function.
Let’s take the example of the cosine function, let’s note f(x) = cos(x), we will look for an approximation of this function in the neighbourhood of 0. We use Taylor’s formula at order 2:
f(x) = f(x0) + f’(x0)(x-x0) + (1/2)f’’(x0)(x-x0)
= 1 — x/2
We can see on figure 19(curve green), that this approximation in the neighbourhood of 0 is pretty good, nevertheless the further we move away the more the quality of the approximation decreases, we can go to higher orders to improve the quality of the approximation.
Now back to Batch Normalization, the classic way to normalize a batch is as follows:
where ε is a small constant added for numerical stability, and μt(θt) and σt(θt) are the mean and variance computed for all the examples from the current mini-batch.
The cross mini Batch Normalization is defined as follows:
where the mean and variance are calculated from the previous N means and variances and approximated using Taylor formulae to express them as a function of the parameters θt rather than θt-N.
Neural networks work better if they are able to generalize better, to do this we use regularization techniques such as dropout which consists in deactivating certain neurons during training. These methods generally improve accuracy during the test phase.
Nevertheless the dropout drops features randomly, this method works well for fully connected layers but is not efficient for convoluted layers where features are spatially correlated.
In DropBlock, features in a block (i.e. a contiguous region of a feature map), are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data.
Mosaic data augmentation
Mosaic data augmentation combines 4 training images into one in certain ratios. This allows for the model to learn how to identify objects at a smaller scale than normal. It also encourages the model to localize different types of images in different portions of the frame.
Self-Adversarial Training (SAT)
Self-Adversarial Training (SAT) represents a new data augmentation technique that operates in 2 forward backward stages. In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself, altering the original image to create the deception that there is no desired object on the image. In the 2nd stage, the neural network is trained to detect an object on this modified image in the normal way with original label before add noise to the image.
eliminate grid sensitivity
Eliminate grid sensitivity the equation bx = σ(tx)+ cx,by =σ(ty)+cy, where cx and cy a real ways whole numbers, is used in YOLOv3 for evaluating the object coordinates, therefore, extremely high tx absolute values are required for the bx value approaching the cx or cx + 1 values. We solve this problem through multiplying the sigmoid by a factor exceeding 1.0, so eliminating the effect of grid on which the object is undetectable.
Using multiple anchors for a single ground truth
We predict several boxes, because it is difficult for a convolution network to predict directly a set of boxes associated with objects of different ratio, that’s why we use anchors that divide the image space according to different strategies.
From the features map created by the convolution layers, we create many anchor boxes of different ratios in order to be able to represent objects of any size, we then decide thanks to the IOU to assign some boxes to an object or a background according to the threshold below.
IoU (truth, anchor) > IoU threshold (formula)
Cosine annealing scheduler
A cosine function is used to update the learning rate, the advantage of the cosine function is that it is cyclic allowing to get out of the local minima more easily than the step method or SGD.
The learning rate will decrease until the end of the cycle, then it will suddenly increase abruptly allowing to possibly extract itself from a local minimum. If the function to be optimized is not convex, then starts to decrease slowly again, by choosing the number of cycles we can thus avoid local minimums.
Optimal hyper- parameters
To try to find the best hyperparameters, genetic algorithms are used to find the most suitable parameters. N randomly selected parameters are initialized. Then we train N models, select the K best models, then we choose random parameters derived from the K best models and we train N2 new models and we start again until we reach the final iteration.
Random training shapes
Many single-stage object detectors are trained with a fixed input image shape. To improve generalization, we can train the model with different image sizes. (Multi-Scale Training in YOLO)
Bag of specials (BoS)
See previous part
Layers of attention are very common in deep learning essentially in language processing, they are found in the latest state of the art models.
In the case of Yolo the attentions are used to highlight the most important features created by the convolution layers and remove the unimportant ones.
SAM simply consists of applying two separate transforms to the output feature map of a convolutional layer, a Max Pooling and an Avg Pooling. The two features are concatenated and then passed in a convoluted layer, before applying a sigmoid function that will highlight where the most important features are located.
For YoloV4, we use a modified version of SAM (figure 27) where the layers of Max Pooling and Avg Pooling have been removed.
NMS (Non-Maximum Suppression) is used to remove the boxes that represent the same object while keeping the one that is the most precise compared to the real bounding box.
where b and bgt denote the central points of B and Bgt, ρ(·) is the Euclidean distance, and c is the diagonal length of the smallest enclosing box covering the two boxes.
We test if the overlap rate minus the distance between the two centers is lower than the threshold ε, if this is the case we keep the bounding box, otherwise we delete it.
As you can see there are many layers to build an object detection model. For YoloV4, the researchers decided to make the best compromise between the mAP and the training and inference speed of the model allowing its use in embedded devices. Nevertheless, with the rise in power of mobile chips that are becoming more and more economical with integrated GPUs specially designed for deep learning processing, it is possible to envisage other architectures for the future.
In the next post, I will talk about the detection of traffic signs with YoloV4, an application used in self-driving car.