We know that we can normalize our inputs to make the training process easier, but won’t it be better if we could normalize the inputs going into a particular layer or every layer for that matter. If all the inputs going into each layer would be normalized, how easy would it be to train the model.
And to implement this, we use Batch Normalization. We implement it as follows:
Znorm(i) = (Znorm(i) - mean) / sqrt((varience)^2 + ∑))
But we don’t want all the hidden units to have mean = 0, and variance = 1. Hence, we change the implementation as follows:
Z´(i) = ∂ * z(norm)(i) + ß
Where “∂” = “Gamma”, and ß are learnable parameters.
Special Case (Inverting case):
- ∂ = sqrt((varience)² + ∑))
- ß = mean
We basically set Znorm(i) = Z(i), which means we invert the effect of Batch Normalization!
This algorithm does not work with all models but when it does, it rocks!
Lets talk about a single neuron, shall we? We know that we use
Wx + b to calculate the value of a single neuron and then pass it through an activation function and then this cycle carries on and this whole process is what Feed Forward means.
Now when we talk about Batch Normalization being applied, we means that the value of
Wx + b before being passed to the activation function is normalized.
Hence, instead of using the normal values of Z, we use the Znorm values to carry out our Feed Forward.
But usually, Batch Normalization is applied to Mini-Batches.
Because Batch Normalization zeros out the mean of a layer, hence, the effect of adding the bias variable b into the Z is not even considered. Hence, we can totally get rid of it.
Hence, the overall implementation is as follows:
Z[l] = W[l] * a[l-1]
Z[l] # norm is calculated then.
Z´[l] = ∂ * Z´[l] + ß[l]
We know that a model performs better if the data it is given is in a specified small range and its generally easier to pick up patterns from it. Now, it’ll be easy if the data is spread among 0 and 1 instead of it being spread among 0 and 10000.
But this is just a simple reason, lets dive it deeper to understand why it works.
One thought is that it makes the layers deeper in your model more robust than the first layers of the model.
This is a phenomenon in Deep Learning that for some distribution of x, y is trained. Then if the distribution of x changes, you’ll have to retrain the model.
For example, if you build a black cat classifier and test it on colored cat images, it won’t work that well, and you’ll have to retrain the model on the new dataset.
Batch Normalization basically limits the effect to which updating the parameters of early layers can effect the distribution of values that next layers see. So, in a way, Batch Norm reduces the problem of the input values changing. It makes them more stable, making the model more general.
It makes each layer learn more, independently, hence, the effect of changing the inputs would be lesser.
Batch Normalization also behaves as a Regularizer:
- Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
- This adds some noise to the values within that mini batch. So, similar to dropout, it adds some noise to each hidden layers activations.
This happens because by adding noise in it, its forcing not to rely too much on any one layer. Also, increasing the mini-batch size, you’re reducing the variance and which, intern, results in lesser regularization effect caused due to Batch Normalization.
The regularization effect, however, is not that much and hence people tend to use dropout with batch normalization for a better model.
Don’t use Batch Norm as a Regularizer!
Batch Norm is trained by using mini-batches of data but when we want to test or predict, we use single images, and hence, we need to make some changes.