VGGNet is invented by Visual Geometry Group (by Oxford University). This architecture is the 1st runner up of ILSVR2014 in the classification task while the winner is GoogLeNet. The reason to understand VGGNet is that many modern image classification models are built on top of this architecture. Just like word2vec in NLP field.
This story will discuss Very Deep Convolutional Networks for Large-scale Image Recognition (Simonyan et al., 2014) and the following are will be covered:
An image can be any size while Simonyan et al. standardize the input to a fixed size 224×224 RGB image. Rescaled training images are cropped randomly. Processing step, subtracting the mean RGB value is the only step in the data processing stage. To increase data volume for model training, data augmentation is applied. The cropped images will flipping horizontally and shifting RGB randomly as well.
To reduce the number of parameters, authors propose to use a small respective field to replace large one. Authors conclude:
- Incorporate multiple non-linear rectification layers instead of a single rectification layer are more discriminative.
- It helps to decrease the number of parameters while keeping performance. For example, using 2 layers of 3×3 filter is equal to 1 layer of 5×5 filter but using fewer parameters. The number of a parameter is reduced by 28% ((25–18)/25). For
Number of Parameters of 2 Layers of 3×3 Filter: 2x3x3 = 18
Number of Parameters of 1 Layer of 5×5 Filter: 1x5x5 = 25
If you want to further understand how stacking small layers perform better than a single large layer, you may check out this story.
Simonyan et al. initialized 6 different ConvNet to see the performance of stacking layers. The difference is the number stacking layer within the same blocks. For example, VGG-11 (i.e Config A) uses 2 Conv3–256 layers while VGG-19 (i.e. Config E) uses 4 Conv3–256 layers in the third layer of blocks.
To handle different scenarios, authors setup 3 experiments which are single scale, multi-scale and multi-crop evaluation.
Single Scale Evaluation
Intuitively, more layer is better. However, the authors found that VGG-16 is better than VGG-19. By comparing among those configurations, VGG-19 (Config E) got the lowest error rate while a number of parameters increased 3.6% only.
Besides using a single scale image for evaluation. Authors introduce multi-scale evaluation. It means that it accepting the different scale of images.
Meanwhile, the authors tried dense and multi-crop methods to evaluate the model.
- It was a great breakthrough in 2013 in image classification. The first time that deep learning achieving error rate under 10%.
- You may find architecture skeleton in many other modern image classification models.
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. You can reach me from Medium Blog, LinkedIn or Github.