Let’s start by grabbing the MNIST dataset. Since we do this a lot, we will define a function to do so.
Let’s now calculate the mean and standard deviation of our data.
We want our mean to be 0 and standard deviation to be 1 (more on this later). Hence we normalize our data by subtracting the mean and dividing by the standard deviation.
Notice that we normalize the validation set with
valid_mean to keep the training and the validation sets on the same scale.
Since the mean (or sd) will never exactly be 0 (or 1), we also define a function to test if they are close to 0 (with some threshold).
Next, let’s initialize our neural network
Problems with initialization
Initializing neural networks is an important part of deep learning. It is at the heart of why we can make our neural networks as deep as they are today. Initializing determines if we converge well and converge fast.
We want to initialize our weights in such a way that the mean and variance are preserved as we pass through various layers. This does not happen with our current initialization.
We can see that after just one layer the values of our activations (output of a layer) are so far off. If we repeat this process for a number of layers it will lead to gradient exploding as shown below.
The activations of our models grow so far beyond reasonable values that they reach infinity. And it doesn’t even take 100 multiplications for this to happen.
So how do we deal with this? Maybe we can scale them down by a factor to keep them from exploding.
And that doesn’t work as well. While the idea was right, choosing a wrong factor can lead to diminishing gradients (values reaching 0).
Choosing the right scaling factor — Xavier init
What should the value of the scaling factor be?
The answer is (1 /⎷input). This initialization technique is known as Xavier initialization. If you want to learn about the math behind the same, you can read the original paper or one of the reference articles mentioned at the end of this article. One good tip when it comes to reading research papers is to search for articles that summarize them.
And dividing by ⎷input does work. Note that if we want to preserve the gradients in the backward pass we would divide by ⎷output.
The Xavier paper also provides a number of good visualizations as shown below.
Problem with Xavier init
The Xavier paper assumes that our activation functions are going to be linear (which they are not). Hence it ignores the effect of our activation functions on the mean and variance. Let’s think about ReLU.
A ReLU takes all the negative values in our distribution and turns them into 0s. That certainly does not preserve the mean and variance of our data. If anything, it makes them half their original value. And this happens every layer so that 1/2s are going to add up.
The solution to this problem was suggested in a paper called Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
The simple idea is that, since our values are reducing by half every time, we just add an extra 2 in the numerator to cancel it out. This initialization technique is known as kaiming initialization or he initialization.
Even though our mean is not so good, it certainly helps our standard deviation. And it is amazing what good initialization can do. There is a paper called Fixup initialization where the authors trained a 10,000 layer neural network without any normalization just by careful initialization. That should be enough to convince you that initializing neural networks well is important.
If you liked this article give it atleast 50 claps :p
If you want to learn more about deep learning check out my series of articles on the same:
- Deep learning from the foundations, fast.ai
- Understanding Xavier initialization in deep neural networks.
- How to initialize deep neural networks?
- Notes on weight initialization for deep neural networks
- Variance of product of multiple random variables