Finally, you compute the sum of all the elements in Z to get a scalar number, i.e. 3+4+0+6+0+0+0+45+2 = 60.
Convolution(Conv) operation (using an appropriate filter) detects certain features in images, such as horizontal or vertical edges. For example- in the image given below, in the convolution output using the first filter, only the middle two columns are nonzero while the two extreme columns (1 and 4) are zero. This is an example of vertical edge detection.
Similarly above filter with 1’s placed horizontally and 0s in the middle layer can be used for horizontal edge detection.
During Convolution, Image(224*224*3) is convolved with a 3*3 filter and a stride of 1, to produce 224*224 array-like shown below.
The o/p(24*24)is passed to the Relu activation function to remove the non-linearity and produces feature maps(24*24) of the image.
2. Pooling+Relu
The pooling layer looks at larger regions (having multiple patches) of the image and captures an aggregate statistic (max, average, etc.) of each region to make the n/w invariant to local transformations.
The two most popular aggregate functions used in pooling are ‘max’ and ‘average’.
- Max pooling: If any one of the patches says something strongly about the presence of a certain feature, then the pooling layer counts that feature as ‘detected’.
- Average pooling: If one patch says something very firmly but the other ones disagree, the pooling layer takes the average to find out.
3. Fully Connected(FC) layer
The o/p of a pooling layer is flattened out to a large vector. It contains a softmax activation function, which outputs a probability value from 0 to 1 for each of the classification labels the model is trying to predict.
summing up above points, the final convolutional neural network looks like –
For more details on the above, please refer to here.
There are various techniques used for training a CNN model to improve accuracy and avoid overfitting.
- Regularization.
For better generalizability of the model, a very common regularization technique is used i.e. to add a regularization term to the objective function. This term ensures that the model doesn’t capture the ‘noise’ in the dataset or does not overfit the training data.
Objective function = Loss Function (Error term) + Regularization term
Hence the objective function can be written as:
Objective function = L(F(xi),θ) + λf(θ)
where L(F(xi),θ) is the loss function expressed in terms of the model output F(xi) and the model parameters θ. The second term λf(θ) has two components — the regularization parameter λ and the parameter norm f(θ).
There are broadly two types of regularization techniques(very similar to one in linear regression) followed in CNN:
- L1 norm: λf(θ) = ||θ||1 is the sum of all the model parameters
- L2 norm: λf(θ) = ||θ||2 is the sum of squares of all the model parameters
2. Dropout.
A dropout operation is performed by multiplying the weight matrix Wl with an α mask vector as shown below.
Then, the shape of a vector α will be (3,1). Now if the value of q(the probability of 1) is .66, the α vector will have two 1s and one 0.Hense, the α vector can be any of the following three: [1 1 0] or [1 0 1] or [0 1 1].
One of these vectors is then chosen randomly in each mini-batch. Let’s say that, in some mini-batch, the mask α=[1 1 0] is chosen. Hence, the new(generalized) weight matrix will be:
All elements in the last column become zero. Thus few neurons(shown in the image below) which were of less importance are discarded, making the network to learn more robust features and thus reducing the training time for each epoch.
3. Batch Normalization.
This technique allows each layer of a neural network to learn by itself a little bit more independently of other previous layers. For example- In a feed-forward neural network
h4=σ(W4.h3+b4)=σ(W4.(σ(W3.(σ(W2.(σ(W1.x+b1))+b2))+b3))+b4)
h4 is a composite function of all previous networks(h1,h2,h3). Hense when we update the weights (say) W4, it affects the output h4, which in turn affects the gradient ∂L/∂W5. Thus, the updates made to W5 should not get affected by the updates made to W4.
Thus Batch normalization is performed on the output of the layers of each batch, H(l). O/p layer is normalized by the mean vector μ and the standard deviation vector ^σ computed across a batch.
Understanding the above techniques, we will now train our CNN on CIFAR-10 Datasets.
CIFAR-10 dataset has 10 classes of 60,000 RGB images each of size (32, 32, 3). The 10 classes are an airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. This dataset can be downloaded directly through the Keras API.
To experiment with hyperparameters and architectures (mentioned above) for better accuracy on the CIFAR dataset and draw insights from the results.
- Adding and removing dropouts in convolutional layers
- Batch Normalization (BN)
- L2 regularisation
- Increasing the number of convolution layers
- Increasing the number of filters in certain layers
Approach:
Initially, to start with, we have a simple model with dataset set to train and test expected to run for 100 epochs and classes set to 10. A simple sequential network is built with 2 convolution layers having 32 feature maps each followed by the activation layer and pooling layer.
- Dropouts After Conv and FC layers
A dropout of .25 and .5 is set after convolution and FC layers. A Training accuracy of 84% and a validation accuracy of 79% is achieved.
2. Remove the dropouts after the convolutional layers (but retain them in the FC layer) and use the batch normalization(BN) after every convolutional layer.
Training accuracy ~98% and validation accuracy ~79%. This is a case of overfitting now as we have removed the dropouts. With high training accuracy, we can say that the dataset has learned the data.
3. Use dropouts after Conv and FC layers, use BN:
- Training accuracy ~89%, validation accuracy ~82%
Significant improvement in validation accuracy with the reduced difference between training and test. We can say that our model is being able to generalize well.
4. Remove dropouts from Conv layers, use L2 + dropouts in FC, use BN:
- Training accuracy ~94%, validation accuracy ~76%.
A significant gap between training and test dataset is found. L2 regularization is only trying to keep the redundant weights down but it’s not as effective as using the dropouts alone.
5. Dropouts after Conv layer, L2 in FC, use BN after convolutional layer
Train accuracy ~86%, validation accuracy ~83%
The gap has reduced and the model is not overfitting but the model needs to be complex to classify images correctly. Hence we shall add more layers as we go forward.
6. Add a new convolutional layer to the network.
Along with regularization and dropout, a new convolution layer is added to the network.
Train accuracy ~89%, validation accuracy ~84%
Though training and validation accuracy is increased but adding an extra layer increases the computational time and resources.
7. Adding feature maps.
Add more feature maps to the Conv layers: from 32 to 64 and 64 to 128.
Instead of adding an extra layer, we here add more feature maps to the existing convolutional network. The choice between the above two is situational.
- Add an extra layer when you feel your network needs more abstraction.
- Add more feature maps when the existing network is not able to grasp existing features of an image like color, texture well.
Train accuracy ~92%, validation accuracy ~84%
Though the accuracy is improved, the gap between train and test still reflects overfitting.
On adding more feature maps, the model tends to overfit (compared to adding a new convolutional layer). This shows that the task requires learning to extract more (new) abstract features- by adding more complex dense network, rather than trying to extract more of the same features.
Conclusion:
The performance of CNNs depends heavily on multiple hyperparameters — the number of layers, number of feature maps in each layer, the use of dropouts, batch normalization, etc. Thus, it’s advisable to first fine-tune your model hyperparameters by conducting lots of experiments. Once the right set of hyperparameters are found, the model should be trained with a larger number of epochs.
The source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.
Credit: BecomingHuman By: Sneha Bhatt