There are few optimization algorithms for finding a local minimum in regression. Gradient descent is the iterative algorithm that used to optimize the learning. The purpose is to minimize the cost function value. Now, let’s try gradient descent to optimize the cost function with some learning rate. Assuming no regularization take into consideration. Recall that the parameters of our model are the theta, θ values.
The hypothesis can be describe with this typical linear equation:
The cost function is
where m is the amount of data points.
Reminder again, the objective of linear regression is to minimize the cost function. Thus, the goal is to minimize J(θ0, θ1) and we will fit the linear regression parameters with data using gradient descent. There are two main function defined, derivative_J_theta() and gradient_descent() to perform gradient descent algorithm.
As shown in the code below, the computation mainly is using the equation express mathematically as above. The derivative_J_theta() function to compute θ0 and θ1. Next, we measure the accuracy or loss of our hypothesis function by using the cost function.
def derivative_J_theta(x, y, theta_0, theta_1):
delta_J_theta0 = 0
delta_J_theta1 = 0
for i in range(len(x)):
delta_J_theta0 += (((theta_1 * x[i]) + theta_0) - y[i])
delta_J_theta1 += (1/x.shape) * (((theta_1 * x[i]) + theta_0) - y[i]) * x[i]
temp0 = theta_0 - (learning_rate * ((1/x.shape) * delta_J_theta0) )
temp1 = theta_1 - (learning_rate * ((1/x.shape) * delta_J_theta1) )
return temp0, temp1def gradient_descent(x, y, learning_rate, starting_theta_0, starting_theta_1, iteration_num):
store_theta_0 = np.empty([iteration_num])
store_theta_1 = np.empty([iteration_num])
# store_j_theta = 
theta_0 = starting_theta_0
theta_1 = starting_theta_1
for i in range(iteration_num):
theta_0, theta_1 = derivative_J_theta(x, y, theta_0, theta_1)
store_theta_0[i] = theta_0
store_theta_1[i] = theta_1
store_j_theta = ((1/2*X.shape) * ( ((theta_1 * X) + theta_0) - Y)**2)
# store_j_theta.append((1/2*X.shape) * ( ((theta_1 * X) + theta_0) - Y)**2)
return theta_0, theta_1, store_theta_0, store_theta_1, store_j_theta
We can now training the model with small iteration number first and observe the result.
x = X
learning_rate = 0.01
iteration_num = 10
starting_theta_0 = 0
starting_theta_1 = 0
theta_0, theta_1, store_theta_0, store_theta_1, store_j_theta = gradient_descent(x, y, learning_rate, starting_theta_0, starting_theta_1, iteration_num)
print("m : %f" %theta_0)
print("b : %f" %theta_1)
The m and b value we will obtain is 3.0219 and 0.6846 respectively.
Let’s plot the line to see how well the hypothesis fit into our data.
plt.plot(X,(theta_1 * X) + theta_0, c='green')
plt.plot(X, X.dot(ne_theta), c='red')
The green line indicates our prediction, while red line is the normal equation.
Almost there! It seem like more iteration will generate better results. We will increase the iteration number to 100.
iteration_num = 100
The m value now is 3.3572 and b is 0.9560. Plot the graph again.
Great! The new trained parameter θ0 and θ1 are both optimized and almost align with the best fit red line. It indicated our gradient descent algorithm is working well.
From this example, we can understand clearly on the mathematics fundamental behind the univariate linear regression algorithm, and it can be very useful to perform prediction in machine learning applications.
Additional information about the differences between Gradient Descent and Normal Equation are summarized in the short notes.
Welcome to discuss more about this simple yet useful analysis technique.