Starting at the top of the mountain, we will take baby steps downhill in the direction where the gradient is negative. After that, we recalculate the negative gradient and take another step in a specified direction. We continue this process until we get to the bottom or local minimum.
Learning Rate
The size of each step is called a learning rate. If you have a high learning rate it means your descent will be much faster, but you will risk overshooting the local minima. With a slow learning rate, however, it will be much more precise, but it will be time-consuming. Finding the “right” learning rate is very important.
Cost Function
The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters. ( Clare Liu)
1. Natural Language Generation:
The Commercial State of the Art in 20202. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
Math
Cost Function
(,)=1∑=1(−(+))2
Gradient Descent
′(,)=[]=[1∑−2(−(+))1∑−2(−(+))]
Code
def update_weights(m, b, X, Y, learning_rate):
m_deriv = 0
b_deriv = 0
N = len(X)
for i in range(N):
# Calculate partial derivatives
# -2x(y - (mx + b))
m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))# -2(y - (mx + b))
b_deriv += -2*(Y[i] - (m*X[i] + b))
# We subtract because the derivatives point in direction of steepest ascent
m -= (m_deriv / float(N)) * learning_rate
b -= (b_deriv / float(N)) * learning_rate
return m, b
References:
Credit: BecomingHuman By: Amir Boltaev