In the first part of this series we discussed the concept of a neural network, as well as the math describing a single neuron. There are however many neurons in a single layer and many layers in the whole network, so we need to come up with a general equation describing a neural network.
The first thing our network needs to do is pass information forward through the layers. We already know how to do this for a single neuron:
Output of the neuron is the activation function of a weighted sum of the neuron’s input
Now we can apply the same logic when we have 2 neurons in the second layer.
In this example every neuron of the first layer is connected to each neuron of the second layer, this type of network is called fully connected network. Neuron Y1 is connected to neurons X1 and X2 with weights W11 and W12 and neuron Y2 is connected to neurons X1 and X2 with weights W21 and W22. In this notation the first index of the of the weight indicates the output neuron and the second index indicates the input neuron, so for example W12 is weight on connection from X2 to Y1. Now we can write the equations for Y1 and Y2:
Now this equation can be expressed using matrix multiplication.
If you are new to matrix multiplication and linear algebra and this makes you confused i highly recommend 3blue1brown linear algebra series.
Now we can write output of first neuron as Y1 and output of second neuron as Y2. This gives us the following equation:
From this we can abstract the general rule for the output of the layer:
Now in this equation all variables are matrices and the multiplication sign represents matrix multiplication.
Usage of matrix in the equation allows us to write it in a simple form and makes it true for any number of the input and neurons in the output.
In programming neural networks we also use matrix multiplication as this allows us to make the computing parallel and use efficient hardware for it, like graphic cards.
Now we have equation for a single layer but nothing stops us from taking output of this layer and using it as an input to the next layer. This gives us the generic equation describing the output of each layer of neural network. One more thing, we need to add, is activation function, I will explain why we need activation functions in the next part of the series, for now you can think about as a way to scale the output, so it doesn’t become too large or too insignificant.
With this equation, we can propagate the information through as many layers of the neural network as we want. But without any learning, neural network is just a set of random matrix multiplications that doesn’t mean anything.
So how to teach our neural network? Firstly we need to calculate the error of the neural network and think how to pass this error to all the layers.
To understand the error propagation algorithm we have to go back to an example with 2 neurons in the first layer and 1 neuron in the second layer.
Let’s assume the Y layer is the output layer of the network and Y1 neuron should return some value. Now this value can be different from the expected value by quite a bit, so there is some error on the Y1 neuron. We can think of this error as the difference between the returned value and the expected value. We know the error on Y1 but we need to pass this error to the lower layers of the network because we want all the layers to learn, not only Y layer. So how to pass this error to X1 and X2? Well, a naive approach would be to split the Y1 error evenly, since there are 2 neurons in the X layer, we could say both X1 and X2 error is equal to Y1 error devised by 2.
There is however a major problem with this approach — the neurons have different weights connected to them. If the weight connected to the X1 neuron is much larger than the weight connected to the X2 neuron the the error on Y1 is much more influenced by X1 since Y1 = ( X1 * W11 + X2 * X12). So if W11 is larger than W12 we should pass more of the Y1 error to the X1 neuron since this is the neuron that contributes to it.
Now that we have observed it we can update our algorithm not to split the error evenly but to split it according to the ration of the input neuron weight to all the weights coming to the output neuron.
Now we can go one step further and analyze the example where there are more than one neuron in the output layer.
In this example we see that e.g. neuron X1 contributes not only to the error of Y1 but also to the error of Y2 and this error is still proportional to its weights. So, in the equation describing error of X1, we needto have both error of Y1 multiplied by the ratio of the weights and error of Y2 multiplied by the ratio of the weights coming to Y2.
This equation can also be written in the form of matrix multiplication.
Now there is one more trick we can do to make this quotation simpler without losing a lot of relevant information. The denominator of the weight ratio, acts as a normalizing factor, so we don’t care that much about it, partially because the final equation we will have other means of regulating the learning of neural network.
This is also one more observation we can make. We can see that the matrix with weight in this equation is quite similar to the matrix form the feed forward algorithm. The difference is the rows and columns are switched. In algebra we call this transposition of the matrix.
Since there is no need to use 2 different variables, we can just use the same variable from feed forward algorithm. This gives us the general equation of the back-propagation algorithm
Note that in the feed-forward algorithm we were going form the first layer to the last but in the back-propagation we are going form the last layer of the network to the first one since to calculate the error in a given layer we need information about error in the next layer.
Now that we know how to pass the information forward and pass the error backward we can use the error at each layer to update the weight.
Now that we know what errors does out neural network make at each layer we can finally start teaching our network to find the best solution to the problem.
But what is the best solution?
The error informs us about how wrong our solutions is, so naturally the best solution would be the one where the error function is minimal.
Error function depends on the weights of the network, so we want to find such weights values that result in the global minimum in the error function. Note that this picture is just for the visualization purpose. In real life applications we have more than 1 weight, so the error function is high-dimensional function.
But how do we find the minimum of this function? A simple idea here is to start with random weights, calculate the error function for those weights and then check the slope of this function to go downhill.
But how do we get to know the slope of the function?
We can use linear algebra once again and leverage the fact that derivative of a function at given point is equal to the slope a function at this point. We can write this derivative in the following way:
Where E is our error function and W represents the weights. This notation informs us that we want to find the derivative of the error function with respect to weight. We use n+1 in with the error, since in our notation output of neural network after the weights Wn is On+1.
We can then use this derivative to update the weight:
This represents the “going downhill” each learning iteration (epoch) we update the weight according to the slope of the derivative of the error function.
There is one more thing we need before presenting the final equation and that is learning-rate. Learning-rate regulates how big steps are we taking during going downhill.
As you can see with bigger learning rate, we take bigger steps. This means we can get to the optimum of the function quicker but there is also a grater chance we will miss it.
With the smaller learning rate we take smaller steps, which results in need for more epochs to reach the minimum of the function but there is a smaller chance we miss it.
That’s why in practice we often use learning rate that is dependent of the previous steps eg. if there is a strong trend of going in one direction, we can take bigger steps (larger learning rate), but if the direction keeps changing, we should take smaller steps (smaller learning rate) to search for the minimum better. In our example however, we are going to take the simple approach and use fixed learning rate value. This gives us the following equation.
Learning rate (Lr) is a number in rage 0 — 1. The smaller it is, the lesser the change to the weights. If learning is close to 1. we use full value of the derivative to update the weights and if it is close to 0, we only use a small part of it. This means that learning rate, as the name suggests, regulates how much the network “learns” in a single iteration.
Updating the weights was the final equation we needed in our neural network. It is the equations that is responsible for the actual learning of the network and for teaching it to give meaningful output instead of random values.