Being one of the major building blocks of a Convolutional Network, the Conv layer handles most of the computational heavy lifting is done by the Conv layer (Koenker et al, 2001).
The next figure below provides an example input volume in red (e.g. a 32x32x3 CIFAR-10 image) along with an example volume of neurons in the first Conv layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, yet to the full depth (i.e. all color channels). On the right, the neurons compute a dot product of their weights based on the input followed by a non-linearity, yet their connectivity is limited to be local spatially.
So far, we discussed how neurons are connected in the Conv Layer, yet we did not discuss how they are arranged. There are three hyperparameters that control the size of the output volume, namely the depth, stride and zero-padding:
1. The depth corresponds to the number of filters each learning to seek something different in the input. For example, if the first Conv Layer takes as input the raw image, then different neurons along the depth may be activated in presence of various oriented edges, or shades of color. This set of neurons all looking at the same region of the input is referred to as a depth column.
2. The stride specifies how to slide the filter. If it is 1 the filters are moved one pixel at a time. If it is 2 or more (which is rare in practice) it moves 2 pixels at a time.
3. Zero-padding enables the control of the spatial size of the output volumes (for example to exactly preserve the spatial size of the input volume so that both the input and output width and height are equal).
The spatial size of the output volume can be calculated as a function of the following:
● the receptive field size of the Conv Layer neurons (F),
● the input volume size (W),
● the stride with which they are applied (S), and
● zero padding amount used (P) on the border.
The formula for calculating how many neurons “fit” is as follows:
Let’s use one graphical example to see how this formula works:
In the figure above, one neuron with a receptive field size of F = 3, the input size W = 5 and zero padding of P = 1 exist within one spatial dimension (x- axis). On the left, the neuron strided across the input in stride of S = 1, giving output of size (5–3 + 2)/1+1 = 5. On the right, the neuron uses stride of S = 2, giving output of size (5–3 + 2)/2+1 = 3.
The pooling layer helps to reduce the amount of parameters in the network to avoid overfitting. It functions regardless of the depth slice of the input and resizes it spatially by means of the MAX operation. One of the most common form of a pooling layer is using filters of size 2×2 along with a stride of 2 downsamples in the input by 2 along both height and width, ignoring 75% of the activations. The maximum numbers taken by every MAX operation would be 4 (2×2 region in some depth slice) while the depth dimension would remain the same.
As it can be seen in Figure above, the pooling layer downsamples the volume regardless of each depth slice of the input volume. On the left side, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64] while the depth size remains unchanged. On the right side, the use of the downsampling operation max results in max pooling as shown with a stride of 2. In other words, each max is taken over 4 numbers (little 2×2 square).
Not everyone likes the pooling operation, and therefore often times[UK2] , due to simplicity reasons the pooling layer might be disregarded in favor of architecture made up of repeated CONV layers. Therefore, larger stride in CONV layer might be used to reduce the size of the representation (Eliasmith, 2013). When it comes to training good generative models, such as generative adversarial networks (GANs), pooling layers are also often ignored. Presumably, future architectures might provide few to no pooling layers (Eliasmith, 2013).
In short, there are three layer types in CNN: CONV, POOL (assuming Max pool unless stated otherwise) and FC (short for fully-connected). The RELU (Rectified Linear Unit) activation function can also be categorized as a layer, which applies elementwise non-linearity.
Before having a look at how all of these layers are stacked together to form entire CNN let’s discuss the inception network which was an important milestone in the development of CNN classifiers. Prior to its inception (pun intended), most popular CNNs just stacked convolution layers deeper and deeper, hoping to get better performance. The Inception network on the other hand, was complex as several tricks were used to push performance in terms of speed and accuracy.
By Fall 2014, deep learning models were becoming extremely useful in categorizing the content of images and video frames. Given the usefulness of these techniques, the internet giants like Google were very interested in efficient and large deployments of architectures on their server farms. There should be some ways to reduce the computational burden of deep neural nets while obtaining state-of-art performance.
A team at Google came up with the Inception module which at a first glance is basically the parallel combination of 1×1, 3×3, and 5×5 convolutional filters. Yet, the great insight of the inception module was the use of 1×1 convolutional blocks to reduce the number of features before the expensive parallel blocks. This is commonly referred as “bottleneck”.
The bottleneck layer of Inception aimed to decrease the number of features, and thus operations, at each layer, so the inference time could be kept low. Before passing data to the expensive convolution modules, the number of features was reduce by, say, 4 times. This led to large savings in computational cost, and the success of this architecture.
Let’s examine this in detail.
For instance, there are 256 features coming in, and 256 coming out. Imagine that the Inception layer only performs 3×3 convolutions (In other words, 256×256 x 3×3 convolutions that have to be performed (589,000s multiply-accumulate operations). Rather than doing this, one can perform the following operations:
· 256×64 × 1×1 = 16,000s
· 64×64 × 3×3 = 36,000s
· 64×256 × 1×1 = 16,000s
one can reduce the number of features that will have to be convolved, say to 64 or 256/4. In this case, first, we need to perform 256 -> 64 1×1 convolutions, then 64 convolution on all Inception branches, and then again a 1×1 convolution from 64 -> 256 features back again. Note that for a total of about 70,000 versus the almost 600,000 we had before (almost 10x less operations)
Although we are doing less operations, we are not losing generality in this layer. In fact the bottleneck layers have been proven to perform at state-of-art on the ImageNet dataset, for example, and will be also used in later architectures such as ResNet.
The reason for the success is that the input features are correlated, and thus redundancy can be removed by combining them appropriately with the 1×1 convolutions. Then, after convolution with a smaller number of features, they can be expanded again into meaningful combination for the next layer.
In December 2015, Google released a new version of the Inception modules and the corresponding architecture This article better explains the original GoogLeNet architecture, giving a lot more detail on the design choices. A list of the original ideas are:
· maximize information flow into the network, by carefully constructing networks that balance depth and width. Before each pooling, increase the feature maps.
· when depth is increased, the number of features, or width of the layer is also increased systematically
· use width increase at each layer to increase the combination of features before next layer
· use only 3×3 convolution, when possible, given that filter of 5×5 and 7×7 can be decomposed with multiple 3×3.
Filters can also be decomposed by flattened convolutions into more complex modules. inception modules can also decrease the size of the data by providing pooling while performing the inception computation. This is basically identical to performing a convolution with strides in parallel with a simple pooling layer. Inception still uses a pooling layer plus softmax as final classifier.
One of the most common pattern of a CNN architecture would be made up of a few CONV-RELU layers, followed by POOL layers. This form would be repeated until the image has been merged spatially to a small size. It is common to make a transition into fully-connected layers at some point. The last fully-connected layer contains the output. In other words, the most common ConvNet architecture would adapt the following pattern (Eliasmith, 2013):
Given an input value x, The `ReLU` layer computes the output as x if x > 0 and negative_slope * x if x <= 0. When the negative slope parameter is not set, it is equivalent to the standard ReLU function of taking max(x, 0). It also supports in-place computation, meaning that the bottom and the top blob could be the same to preserve memory consumption.
Dealing with this high frequency noise has been one of the primary challenges and overarching threads of feature visualization research. If you want to get useful visualizations, you need to impose a more natural structure using some kind of prior, regularizer, or constraint (Hochreiter et al, 2001).
In fact, if you look at most notable papers on feature visualization, one of their main points will usually be an approach to regularization. Researchers have tried a lot of different things!
We can think of all of these approaches as living on a spectrum, based on how strongly they regularize the model (Hochreiter et al, 2001). On one extreme, if we don’t regularize at all, we end up with adversarial examples. On the opposite end, we search over examples in our dataset and run into all the limitations we discussed earlier. In the middle we have three main families of regularization options.