Why discus Anchor? While trying to implement yolo from scratch or create your training pipeline for your custom dataset, or do some data augmentation for object detection especially in yolo, it seems difficult if you don’t understand the idea behind Anchors.
After I went through the this nice explanation of implementing the inference part of Yolov3 from this nice blog post by Ayoosh. I decided to see if i could implement the training pipeline my self, but i found it very difficult to even understand how everything is done, then I found out that it all base on the full understanding of anchors.
Well I wont tell you I later implement everything from scratch myself, I was able to find out an implementation in pytorch from Ultralytics. From there i was able to grasp the full details.
So here is my take on how I think it works;
What is Yolo in basic with out too much jargon ? Well if you are familar with convolutional neural network , skip connection and upsamples (oh!! i say no jargon, pardon me), then this should be easier to understand.
If you can remember vividly while training images for image classification, you would have heard of how different layer of the conv neural network have their representation of the image, we call this Feature map.
Each of the layer base on their dept capture different representation of the images. The first layer do capture the most abstract representation of the images and the deep layer captures more concrete description of the images. But something do happen as we move deeper down the network across each layers from the first to the last layer, some spatial features(information) are being lost.
And since Yolo needs the spatial features which account for the position of an object in an image, it makes use of skip connection (residual network ) a lot, it also avoid pooling(since pooling will destroy the spatial features) and rather it uses convnet with stride to do the downsampling.
Yolo reduce the images size from the first layer to the last layer in to a square grid based on the stride. Yolo has three detection layer;
So what happens at each of this detection layer? remember I said conv layer output feature maps, which also contains spatial features, so for the last detection the image is reduce from 416 X 416 to 13 X 13.
To run detection across this feature map, yolo needs to find what each of the cell in the 13 X 13 grid size feature map contains, so how does it get to know what each cells contains
Each cells is assigned 3 anchors containing some set of properties (x, y, w, h,object score, classes). So using this anchors we can check if they contain an object using their object score, and then check the type of class they contain.
Let say your model contain 2 classes , then your anchors contains (x,y,w,h,object score , class (2)), hence it contains 7 properties , then remember each cells contains three anchors, and that will be 3 anchors X (7) = 21 element per cell, the formula for this is B x ( 5 + C) . Then for yolo to take into consideration the anchors, it modify the channel of the feature map, and by channel, I mean the RGB in normal images.
Since in Pytorch the conv output is always in B X C X H X W. then using the above information, the output of the detection layer will be, (32 X 21 X 13 X 13), that’s if we are using a batch size of 32. And the channel size has account for the B X (5 + C) which is the anchors element. To make it accessible and easier for us to run our prediction we need to reconstruct the output in to (batch_size X grid_size *num_anchors X (5+num_class)) (32 X (13*13*3) X 7) = (32 X 507 X 7). We do the same thing for the previous detection layers and we then append their output.
Instead of Yolo to output boundary box coordiante directly it output the offset to the three anchors present in each cells. So the prediction is run on the reshape output of the detection layer (32 X 169 X 3 X 7) and since we have other detection layer feature map of (52 X52) and (26 X 26), then if we sum all together ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647, hence the final output is (32 X 10647 X 7).
The prediction goes forth like this
where tx, ty, tw, th are what the model output, the first for digit in the output shape of (32 X 10647 X 7) just like this
output[:,:,:4] and cx and cy are the top-left co-ordinates of the grid. pw and ph are anchors dimensions for the box.
In Short: Think of yolo as if you are training your resnet for an image classification, then after training , you take out the feature layer and use the the output of the last layer as your feature map for your object detection (just that you have to avoid the use of a pooling layer), the out put of the feature map, depending on the input and the stride use, let say your output looks like this (40 X 64 X 26 X 26), Then you can reshape it to have an anchor , if you think of the channel as considering your anchor, hence you’ve design your type of object detection.
Understanding that is not enough for training the model. There a lot of library for training the yolo v3 e.g I do make use of the pytorch implementation from Ultra-lytics. To train the model it self, your dataset can contain images of different size, yolo gives the decision of using kmeans to generate your anchors your self.
If you decide to make use of the default anchors you have to fit your images into the 416 X 416. And if your images are fit into the size 416 X 416 hence the ground truth label will change also.
The first image is the orginal image, will then fix it into a 416 X 416 box, to obtain the second image. Since the image dimension is now 416 X 416 we need to convert the boundary box (the ground truth ) from the former size to the new size 416
x = apect_ratio * orig_w * (orig_x - orig_w /2) + pad
y = aspect_ratio *orig_h * (orig_y - orig_h /2) + pad
w = aspect_ratioi *orig_w * (orig_x + orig_w /2) + pad
h = aspect_ratio * orig_w * (orig_y + orig_h /2) + pad
The aspect ratio and pad is generated from the code that transform the second image above, note that in the image above we have some blank spot, thats the padded part, so we need to consider them when transform the ground truth and the aspect ratio we use to prevent distortion of the image is also considered in transforming the ground truth.
Now how do we match the prediction to ground truth?
One of the trickest part I found confusing and hard to get is how the images will be used to match the prediction from the model. Remember the ground truth boundary box are in the dimension of 416 X 416 and the model prediction we are to use is in 13 X 13. How do we compare the boundary box from 416 X 416 to that presented in 13 x 13.
How I get to think of it is this, for we to match both; since 13 X 13 is thought of to be a grid, all we just need to do is divide the image 416 X 416 in to a grid of 13 X 13.
Doing it like this we make our life easy, remember that for each of the detection layer like 13 X 13 , there is three anchors of different size assign to each cell in the 13 X 13 grid, 52 X52 and 26 X 26 as their assign to them.
For we to do this, we create an array of (13*13*3 X 5), since we have a grid of 13 X 13 and 3 anchors per each cells, the 5 here is the class, x,y,w,h. Then we create and offset for the grid 13 x 13 using
meshgrid(13,13) , this provide us with an offset of x,y wich we can fill the array
[:,:2] with, then for each of the width and height we fill it with the anchors associated with the detection layer, I mean the grid like 13x 13. Note that each of the anchors has being divided by the stride associated with the detection layer
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
we chose the index 0, 1, 2 as the 13 X 13 grid anchor which are
anchor = [[10,13],
we then find the stride and divide the anchor by this stride, remember that this anchor are generated for the image dimension of 416 X 416, but now we want to use them for 13 X 13 , hence we find the stride (i.e how we move accross the image, stride guid our movement and distance cover accross the image), so we need to generate a stride that tells us how we move accross 416 X 416 to generate a grid 13 X 13.
hence we find the stride (i.e how we move accross the image, stride guid our movement and distance cover accross the image), so we need to generate a stride that tells us how we move accross 416 X 416 to generate a grid 13 X 13.
stride = 416 / 13anchors = anchor / stride
Now we’ve generate an anchor representation for 13 X 13, then we fill each of the array with the anchors.
Now we come to the ground truth label which is in the form off
class, x, y, w, h
0 0.13 0.191 0.252 0.33
0 0.403 0.201 0.237 0.319
Now the value from x to h are in between 0 and 1, now to represent them in 13 X 13 we scale them up to the dimension, which I think can be done by scaling them by multiplying them by 13
( 0.13 0.191 0.252 0.33) * 13
( 0.403 0.201 0.237 0.319) * 13
Once this is all done , we use Jaccard index called Intersection over Union
We are not yet sure of the location or the position of the Image on the grid 13 X 13, so what we can do is to run find the intersection over union between each anchor and the boundary box.
After the whole process of IOU, we get the IOU score of the relation of each boundary box to the 13 X13 element. Then the element in the 13 X 13 that has the highest iou is assign that boundary box.
(13*13*3X 5) = (507 X 5)
iou([:, 1:], (( 0.13 0.191 0.252 0.33) * 13) )//output
Those are the IOU output for each of the anchors, the anchors with the heighest iou which is 0.5 is replaced with the value
(0.13 0.191 0.252 0.33) * 13).
Hence thats how we represent the images in the grid detection layer.
For more Understanding of Anchors see the Reference below, especially the first one.