Now that I have introduced Marr’s representational framework for vision and Palmer’s model of visual perception, we can now move onto one of the most fascinating biologically-inspired deep learning algorithm known as the Convolutional Neural Network(ConvNet/CNN). But before knowing what is up with this complicated CNN, I will first introduce the concept of the artificial neural network (ANN) in which I will skip entirely all the mathematical detail behind the architecture and will only focus on the basic ‘stuff’ ’one needs to know in order to understand what the convolutional neural network is actually doing (or at least I hope you do!). I will, however, explain in deeper detail on its architecture and the math behind it in my next series (and I hope you stick around long enough to read them!).
Brain cells or neurons (shown above) which constitute animal brains consists of 3 main parts — the cell body; the dendrites (input unit); and the axon(output unit). Neurons communicate with each other via electrical events called ‘action potentials’ at a junction known as synapse. Here at the synaptic cleft, neurotransmitters (signal input) from the presynaptive neuron are released where they bind to the receptors of the postsynaptic neuron.
“The binding of neurotransmitters to the postsynaptic neural receptor can either cause the postsynaptic neuron to fire the action potential or inhibit it from firing.”
You can think of Artificial Neural Network(ANN) as a machine learning architecture that borrows its form loosely from the biological neural structure, where activation of the neurons from the first(input) layer spit out some numerical values that are passed onto the next layer which can either cause excitatory or inhibitory behavior of the neurons in the second (hidden) layer. The same process happens in the second layer onto the next layer. Since our super simple and vanilla example only has one hidden layer, the last one spits out the output.
So are neural networks the same as the infamous ‘deep learning’? Not quite. But all the deep learning architectures are based on the ANN. The only difference is that the DL architecture has more than one hidden layer which is why it is also known as Multilayer Perceptron (MLP). In other words, a neural network is considered ‘deep’ when they have >1 hidden layer.
“However, do not confuse deep neural network(DNN) with deep learning(DL)”
‘Deep Learning’ is a category of multilayer neural network-based machine learning structure itself. Architecture such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks(CNN) all fall under the category of deep learning. However, today we will only focus on CNN.
Coming back to CNN, you may wonder why did I choose this type of neural network in particular and not other deep learning models to talk about. The answer is because the network was created exclusively for image recognition tasks and has been extensively used in the field of computer vision for decades, be it self-driving cars, medical image analysis or object/face detection. The first Convolutional Neural Network — LeNet-5 — was first introduced in 1998 in a paper by Bengio, Le Cun, Bottou and Haffner where it was able to classify digits from hand-written numbers.
Also, let me remind you why it is so complicated for a computer to recognise or make a classification decision about an image.
Humans can spot and recognize patterns without having to re-learn the concept and identify objects no matter angle we look at. The normal feed-forward neural network can’t do this. While we can easily see that the image above is a cat, what a computer actually sees is a numerical array where each value represents the colour intensity of each pixel. However, Convolutional Neural Net is able to do so because it understands translation invariance — where it recognizes an object as an object, even when its appearance varies in some way. But how does it do that?
Just like a typical neural network, we introduce (or feed) some labeled images into the model where the image pixel values are computed in the first layer of the network and then passed along the rest of the layer until it reaches the final layer where the network finally produces a predicted classification. The data fed into the convolutional networks are trained through supervised learning and backpropagation where the predicted value will be compared to the correct output, where those that are wrongly classified will cause a large error gap and will cause the learning process to backpropagate to make changes to the parameter in order to give out a more accurate outcome. The network goes back and forth, correcting itself until the satisfying output is achieved.
But you might ask…Hey! Deep neural network uses backpropagation as well, so why is convolutional network any special?
Good question. What’s special about the convolutional network lies in the way the connections between the neurons are structured and the unique hidden layer architecture inspired by our very own visual data processing mechanism inside our visual cortex. And unlike deep neural networks, the layers in CNN are organised in 3 dimensions: width, height, and depth (which can be represented in 3-dimensional matrix). Here, I want you to remember one of the most important properties of the convolutional network regardless of how many layers there are — that the whole system of CNN is composed of only two major parts:
- Feature Extraction:
During FE, the network will perform a series of convolutions (think of convolution as combining two things together to give certain output) and pooling operations where features are detected. This is the part where certain feature such as the cat’s ear, paw, fur colour is recognised.
Here, the fully connected layers will serve as a classifier on top of these extracted features. They will assign a probability for the object on the image being what the algorithm predicts it is.
The primary component of CNN is the convolutional layer. Its job is to detect important features in the image pixels using the same concept as to how the simple cells detect simple features such as edges and lines before the information gets processed further by the complex cells. Layers within the first section of the network (closer to the input) will learn to detect simple features such as edges and color gradients, whereas deeper layers will combine simple features into more complex features.
It might seem confusing since I’m avoiding all the maths behind the convolution but all you need toknow is that the system use what it’s called a ‘filter’ (also known as a kernel) to detect these features. As shown below, a filter is basically a matrix which when slide over the input image will perform linear computation which could reduce the image size + extract the important features of the image. The convolved outputs are later stacked together into ‘feature maps’ (the convolved output).
The networks will “decide” which feature is important through training. What we do is to perform multiple convolutions on an input, each using a different filter, which will result in the creation of many distinct feature maps. We then stack all these feature maps together …and tada! We get the final output of the convolution layer. The 3-dimensional output is then flattened into a one-dimensional vector before fed into the ‘Fully Connected Layer’. This is where things start to look like a normal neural network where every input is connected to every output by a learnable weight. After being fed, the data goes through ‘probability conversion’ using the Softmax function where it assigned the probability of the input image belonging to a particular category. During training period, high error cost resulted from misclassified outputs will cause the model to backpropagate in order to make changes to the parameter and predict a more accurate outcome. The network goes back and forth, correcting itself until the satisfying output is achieved.
Below is the summary of what I have covered so far about CNN. Now you should know that:
- The first layer is responsible for detecting lines, edges, changes in brightness, and other simple features.
- The information is then passed onto the next layer, which combines simple features to build detectors that can identify simple shapes.
- The process continues in the next layer and the next, becoming more and more abstract with every layer. The deeper layer will be able to extract high-order features such as shapes or specific objects.
- The last layers of the network will integrate all of those complex features and produce classification predictions.
- The predicted value will be compared to the correct output, where those that are wrongly classified will cause a large error gap and will cause the learning process to backpropagate to make changes to the parameter in order to give out a more accurate outcome.
- The network goes back and forth, correcting itself until the satisfying output is achieved (where the error is minimised).
Now that we know about the visual cortex and the convolutional network, the obvious question was: How similar are these two systems, and what are their similarities?
If you could recall the first part of this series, I have mentioned how our visual data processing in the visual cortex begins with the detection of lines, edges, corners by the simple cells and the analysis of other complex features (such as colours, shape, orientation) by the complex cells, which also shown to have more spatial invariance (not dependent of orientation) in their response. Studies have concluded that complex cells achieved this by pooling over visual data from multiple simple cells, each with a different preferred location. and just like how the cells process visual information in the cortex, these two features — selectivity to specific features and increasing spatial invariance through feedforward connection — is what make the artificial visual systems like CNNs very unique.
Another interesting finding by researchers at the University of Tartu’s Computational Neuroscience Lab, in Estonia, has found that the activation of deep convolutional neural networks actually resembles that of a gamma band activity of the human visual cortex. The gamma-band here refers to the electrical response frequency pattern in the brain (30 to ~70 Hz). These findings are aligned with previous research emphasizing the importance of gamma activity for object recognition.
Many studies suggest that “core object recognition,” the ability to rapidly recognise objects despite the wide range of appearance variation, is solved in the brain via a cascade of reflexive, largely feedforward computations that culminate in a powerful neuronal representation in the inferior temporal cortex. However, the algorithm underlying such complex processes is still very little-understood today.
— — — — — — — — — IMPORTANT NOTE: — — — — — — — —
Although the processing might seem similar in both systems, it is a fallacy to believe that this is how the biological neurons actually behave or that our neural networks use the same exact feedforward and backpropagation model to learn. Visual perception is a linear process that proceeds from the retina to other specialised processing regions. This is the feed-forward network that machine learning (ML) has attempted to imitate, albeit with no idea how the biological synaptic weights are established.
In my next story, which is going to be the last article of my Computer Vision series, I shall discuss all the current issues in computer vision and the challenges it faces, along with the role of neuroscience and cognitive science in the advancement of true artificial intelligence. Thank you for reading.
Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex https://www.nature.com/articles/s42003-018-0110-y
Neural Networks and Deep Learning http://neuralnetworksanddeeplearning.com/
How does the brain solve visual object recognition?https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3306444/