Inspired by neuroscience, it is no longer a mystery for ML engineers how a perceptron takes in inputs, applies a linear combination, and if that linear combination is greater or lesser than some threshold value, produces an output of 1 or 0 respectively. In the language of mathematics, it looks like this:
In this formula, the Greek letter Sigma ∑ refers to the summation, while i means to iterate over input (x) and weight (w) pairings.
For the sake of simplicity, let’s replace the threshold with what’s known as the neuron’s bias. Now we can rewrite the equation as:
In this case, bias could be represented as — threshold. In other words, bias denotes how easy it is to get the neuron to output a 1. While a big bias would mean that it is easy for the neuron to output a 1, in the case of a negative bias it would be more difficult.
There are a couple of reasons to make this small change. We randomly assign numbers for weights (we don’t know the right ones beforehand), and as the neural network trains, it makes incremental changes to those weights to produce more accurate outputs.
A function that transforms the values or states the conditions for the decision of the output neuron is known as an activation function. This is just one of the various activation functions called as ‘Heaviside step function’.
Other activation functions include the sigmoid, tanh, and softmax functions, and they each have their own purpose.
Even when dealing with absolutes (1s and 0s; yes and no’s), it would be beneficial to have the output give an intermediary value. It is a bit like answering “maybe” when asked a yes-or-no question you have no idea about, instead of guessing. This is essentially the advantage of the sigmoid function over the Heaviside.
Like the Heaviside step function, the sigmoid has its values between 0 and 1. Yet this time, there is no step; it has been smoothed out to create a continuous line. Thus, the output can be considered as a probability of being a success (a 1), or a yes. Having this feature is particularly important for the network’s learning capabilities.
So far, we have explored the architecture of the perceptron, the simplest neural network model, and seen two activation functions: the Heaviside step function and the sigmoid function.
Yet, if we were to make these networks more complicated, any number of layers in between these two are known as hidden layers. The more the number of layers, the more nuanced the decision-making can get.
The networks often go by different names: deep feedforward networks, feedforward neural networks, or multi-layer perceptrons (MLP). They are called feedforward networks because the information flows in one general (forward) direction, where mathematical functions are applied at each stage. In fact, they are called ‘networks’ because of this chain of functions (the output of the first layer function is the input to the second layer and that output is the input to the third layer and so on). The length of that chain gives the depth of the model, and this is actually where the term ‘deep’ in deep learning comes from.
Adding hidden layers can allow the neural network to make more complex decisions, but more on that, and how neural networks learn through a process known as backpropagation.
Can you imagine what a daunting task thinking would be if we were to start our thinking process from scratch every time. For instance, while reading a book we process each word based on our understanding of previous words rather than start thinking from scratch again as our thoughts are persistent.
From the perspective of a traditional neural network, it would not be so clear how to use the reasoning about previous events in a film to inform later scenes (Koenker et al, 2001).
Another way to think of this is that the network contains “memory” nodes that depend on all previous memory. For a network with 3 layers (input, hidden, output), and 2 activation functions (f & g), a standard feed-forward network could be represented as (Fig. 1):
hidden = f(input)
output = g(hidden)
These Neural Networks receive an input (a single vector), and transform it through a series of hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where there is no connection among neurons in a single layer function. The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores (Koenker et al, 2001).
Convolutional Networks
In the past, neural networks for complex tasks were too inefficient to be used in practice (Hopfield, 1982). A more practical method was devised by Yann LeCun and his team of researchers in the AT&T labs by applying a convolutional neural network to a handwritten digit recognition task (Hopfield, 1982). They came up with the idea that instead of having the n-number of weights (nodes in the layers after the input layer) process images fully, one could go over image to discover related features. Not only does it drastically reduce the computation requirements, but it also makes the weights generalizable (i.e. once a feature is learned it can be applied to the same image in a different location and also in different images) (Koenker et al, 2001).
We can think of images as two-dimensional functions. In the case of mage transformations convolutions occur in such a way so that the image function is convolved with a very small, local function called a “kernel.”
The kernel slides to every position of the image and computes a new pixel as a weighted sum of the pixels it floats over.
For example, by taking the average of a 3×3 box of pixels, an image can be blurred as the kernel takes the value 1/91/9 on each pixel in the box. If side by side pixels are similar, approximately zero will be returned. In the case of edges, adjacent pixels are very different in the direction perpendicular to the edge.
So, what is the relationship of convolution to convolutional neural networks?
Consider a 1-dimensional convolutional layer with inputs {xn} and outputs {yn}.
yn=A(xn,xn+1,…)yn=A(xn,xn+1,…)
Generally, A would be multiple neurons. But suppose it is a single neuron for a moment. Recall that a typical neuron in a neural network is described by:
x0x0, x1x1… are the inputs while the weights, w0w0, w1w1, … describe how the neuron connects to its inputs.
If the weight is negative an input inhibits the neuron from firing,
If it is positive, it encourages.
It’s this wiring of neurons, describing all the weights and which ones are identical, that are the main purpose of convolution.
The aim is to have a weight matrix, WW:
For example, as seen in this formula, Each row of the matrix describes the weights connecting a neuron to its inputs.
As there are multiple copies of the same neuron, many weights appear in multiple positions, which corresponds to the equations:
So while, normally, a weight matrix connects every input to every neuron with different weights:
It should be noted that while he same weights appear in various positions, neurons don’t connect to many possible inputs, therefore there are lots of zeros.
Multiplying by the above matrix would be equal to convolving with […0,w1,w0,0…][…0,w1,w0,0…]. Yet, what about two-dimensional convolutional layers?
As seen in Figure 3, the wiring of a two dimensional convolutional layer corresponds to a two-dimensional convolution.
Let’s consider our example of using a convolution to detect edges in an image, above, by sliding a kernel around and applying it to every patch. In this case, a convolutional layer will apply a neuron to every patch of the image.
One of the biggest advantages provided by CNN is that the input is recognized as an image by having neurons arranged in 3 dimensions: width, height, depth. The neurons in a layer will only be connected to a small section of the layer before it, rather than all neurons being fully-connected. The following figure provides a visualization:
In Fig. 5, on the left hand side there is a regular 3-layer Neural Network. On the right hand side, a ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. The red input layer holds the image, furthermore its width and height would be the dimensions of the image, while the depth would be 3 (Red, Green, Blue channels).
Main type of layers to build ConvNet architectures are:
● Convolutional Layer,
● Pooling Layer, and
● Fully-Connected Layer.
Below is an example for a simple ConvNet for CIFAR-10 classification which has the architecture [INPUT — CONV — RELU — POOL — FC]. In more detail (Eliasmith, 2013):
● INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.
● CONV layer will deliver the output of neurons connected to local regions which may result in volume such as [32x32x12] if the decision is to use 12 filters.
● RELU layer applies the max(0,x) thresholding at zero. The volume remains unchanged ([32x32x12]).
● POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
● FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10.
In this way, a transformation of the original image takes place layer by layer as CNN moves the original pixel values to the final class scores. While some layers contain parameters others don’t. It should be taken into account that the transformations performed by CONV/FC layers are not only a function of the activations in the input volume, but also of the parameters (neurons’ biased and weights) (Koenker et al, 2001).
In Figure 5, while the initial volume stores the raw image pixels (left) the last volume stores the class scores (right). Each column refers to the volume of activations along the processing path. Given the difficulty of visualizing 3D volumes, each volume’s slices are laid out in rows. The last layer volume refers to the scores for each class.
This may sound a bit confusing, yet who said that neural networking would be easy. Yet, once these basics are grasped in detail, it should also be not a rocket science!
Credit: BecomingHuman By: The AI LAB