Part-I of this blog series concluded by forming a compressed matrix for each of the features of the object to be identified. In essence, matrix so formed removed the bias stemming from magnification/reduction, rotation, and distortion of the image. Job is just half done, our algorithm still didn’t meet its core objective of classifying the objects in a given image. In this blog, we will progress to the subsequent step to build a neural network model that will learn by combining the various inputs features and successfully identify the objects in an image, with higher accuracy (Refer Pic -1)
Our compressed matrix from the previous step (Pic-2)
Each number in the above matrix (i.e) 0.33,1.0,55 etc is a neuron and forms the first layer of our neural network. The output layer will be the various objects that the model is been trained to learn and classify.
1. AI for CFD: Intro (part 1)
2. Using Artificial Intelligence to detect COVID-19
3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code
4. Machine Learning System Design
For time-being, the intermediate hidden layers can be regarded as a Blackbox which will cruise us towards our goal(Pic-3)
Do we need a black box? can we directly code a function that will map a given combination of input numbers to a specific output?. That’s a typical programmers mindset, but it doesn’t work in the world of Machine learning. To clarify, let’s consider the below image of a set of animal eyes(Pic-4), can you identify the animals based on it?
Few of us may do it with some degree of difficulty but for a higher-speed and greater accuracy, our brain doesn’t depend solely on one feature rather it combines a set of features to form a visual map. Neural network exactly replicate this model, it associates various features and classifies an image, this technique not only guarantees higher speed but also an improved precision
Our task is now simplified in unraveling the mystery around the so-called “Black Box”, technically this box is formed by layers of neurons (hence the name Neural Network) each layer connected to the previous one by a set of weights, a sample neural network (Pic-5)
Each of the connecting arrows has a “weight” associated with it. To start with we assign a random number as weight and we calculate the weight of a neuron in the hidden layer as the sum of all its incoming weights (Forward propagation), refer pic-6
In pic-6, we had oversimplified neuron “y” having only one input so its weight is the product of W and X0, assuming we had two inputs X0 and X1 with weights W1 and W2, then we can compute the value of Y as the summation of products X0, W1, and X1, W2. Extending this technique we can keep propagating forward forming more hidden layers by combing various neurons from the previous layer. If we take a step back and observe this is essentially a permutation of all possible input features leading to an output image. Tensorflow, an open-source API from Google offers a no strings attached playground wherein you can try various combinations (Increase/Reduction of Input features, Hidden layers, weights, addition of noise, etc) of this Black box.
As we set up hidden layers and assign the weights, the model attempts to classify an input image, the first few iterations will be futile resulting in an inaccurate classification, this is part of the learning curve. Weights are continuously adjusted and tuned until the model comes closer in making better predictions. Though theoretically, it sounds easier, mathematically this step takes significant effort in the whole program.
Let’s get one level deeper to understand it, following our previous example of elephant image as input, if the model classifies the output as Zebra with 80% confidence and the Elephant with 50% confidence, then we certainly know it’s incorrect and weights need to be tuned. But which direction do we tune the weight? should we increase or decrease it? and how many weights should I keep adjusting? Let’s resolve one at a time. To answer the first query of which direction should we change the weight? we embrace a concept called Gradient Descent. (Pic-7)
In the above picture, point X2 is the only location wherein the slope of the curve is zero, in simple terms, at X2 we get the most optimal result in all other points either we have a positive or negative slope. So determining X2 is the goal of gradient descent. Continuing with our previous example, so if we change weight W1 confidence limit of elephant increases from 50 to an integer greater than 50(i.e >50) than we are traveling in the right direction. Basically, we then adopt a “Trial and Error” method to reach the point of zero gradient, when we confidently know the direction but we will have no way of exactly knowing the location of X2. We keep iterating, if we take huge steps we run a risk of missing zero gradients and jumping to a positive gradient zone on the other hand if our steps are too smaller then we are risking a lot of computation intense resources (budget overshoot)and also slowing down of the whole program. Technically this is called “Learning Rate” and is a hyper-parameter.
For a CNN beginner, this will be a good start in gradient descent, as we embark on larger ML programs, the complexity increases as a curve may have more than one gradient(mini-batch, stochastic, etc) refer pic-8, which further convolutes the whole exercise and we adopt other complimenting techniques(more of it later)