The sigmoid’(z1),sigmoid’(z2).. etc are less than 1/4. Because derivative of sigmoid function is less than 1/4. See below. The weight matrices w1,w2,w3,w4 are initialized using gaussian method to have a mean of 0 and standard deviation of 1. Hence ||w(i)|| is less than 1. Therefore, in derivative we multiply such terms which are less than 1 and 1/4. Hence on multiplying such small terms for a huge number of times we get very small gradient which makes the model to almost stop learning.
The reason that if we have deeper models than starting hidden layers will have low speed of learning is: we move deeper as we reach the starting hidden layers during backprop and hence more such terms are involved which makes the gradient small.
Similar is the case with exploding gradient, If we initialize our weight matrices with very large values, then the derivative will be very large and hence the model will have highly unstable training.
Thanks for Reading..guys.
Credit: BecomingHuman By: Manik Soni