And yes, these parameters are “learned” by your algorithm. What I mean is that the machine has to figure out the values of these parameters itself.

Before discussing how these parameters are learned by our algorithm. Let’s make some changes to the cost function.

This cost function is the mean(average) of the squares of the error terms divided by 2. This function is also called — Can you guess?. Mean Squared Error(M.S.E).

Now let’s talk about how these parameters are learned.

Let’s plot the above equation on a graph.

figure 7
Let’s just understand what just happened here.

We have a quadratic function y = x² (you can also write this as f(x) = x² where x is the input) which when drawn on a graph will give us a parabolic shape(figure 7).

As you already know that our goal is to minimize the cost function. To learn how to minimize a function with multiple inputs(like m and c in our cost function). We will first understand how to minimize a function with one input(like our quadratic equation y = x²). Remember, whenever I say “minimize a function”, you should think about what value of input will give the lowest output.

Like for our function f(x) = x², if we put x = 0(input), then,

f(0) = 0² = 0

we get 0 as output.

What if we use negative numbers as input here?

Well, that won’t make a difference because our function is squaring whatever input you feed it. So if you feed it a negative number then squaring it will give a positive number which is greater than zero.

f(-2) = (-2)² = 4

In figure 7, you can see that our function is minimum(means y = 0 or f(0) = 0) at x = 0. Well, that was easy, isn’t it? You just put zero as input and get zero as output. Well, it ain’t that easy. Let me explain why?

If you put m and c equals to zero in our cost function(in figure 6.1). Will you have zero as output? No, because then mx+c will be 0, no matter what value of x you put there. So you did not minimize the difference between y(actual output) and mx+c(predicted output). Pick up a pen and paper and think it through before moving forward.

But we can plot our function graphically and find at which our function is minimum?

It’s more than a “yes” or “no ” question.

Let me explain.

For a function like f(x) = x², where you have only one input and one output, you can represent input on the x-axis and output on the y-axis(like in figure 7). In other words, you can visualize it on 2D.

You can also do the same for our cost function — representing m(input) and c(input) on the x-axis and y-axis respectively and cost(output) on the z-axis. In other words, 3D.

Now if you have a function with 3 inputs(and 1 output). Can you tell me the dimension you are going to need to visualize it?

Yeah, it’s 4D and there is no way you can visualize that. So our visualization is limited to only 3D.

Is there any other way to find the minimum of the function?

Yes, there is. That other way is gradient descent.

figure 8
See the black dot in figure 8. Assume that you are standing in that place and you want to go to that red dot(because the output( y or f(x) ) is minimum at red dot). At any given instant, you will have two options, either go up or down, either take a big step(green arrow) or a small step(red arrow) in a direction where your output is decreasing.

I know it’s somewhat confusing. That how gradient descent is going to figure out which way is down or which is up? And even it does figure out how it’s going to decide if it should take a bigger step or a smaller step?

Just bear with me for a few more minutes. I will start with how it’s going to decide which direction it should take?

f(x) = x²

f(-4) = (-4)² = 16

(-4,16) is the position of our black dot where -4 is the input and 16 is output. Don’t forget that this black dot is randomly chosen. We can start anywhere.

Now I want you to bring out the calculator before moving on.

If you can find a way to know in which direction the slope is getting less steep then you know in which direction you have to go because if the slope is less steep then it’s telling us that we are going downhill. Like the slope at that red dot is 0.

How to find the slope?

You can find the slope using the derivative. It’s okay if you don’t know calculus. Just for now, remember that you can find the slope using the derivative of the function.

lr = learning rate

Let’s take lr = 0.1

current_input = -4 and output=16

Step 1:- current_input = -4-(0.1 * 2 *- 4) = -3.2

Step 2:- Output at current input = f(-3.2) = (-3.2)² = 10.24

figure 9
Repeat Step 1 and 2 again.

current_input = -3.2-(0.1 * 2 * -3.2) = -2.56

Output = f(-2.56) = (-2.56)² = 6.55

figure 10
Can you see what is happening here?

With each step, our black dot is going downhill. You should also notice that with each step our black dot is taking smaller and smaller steps because as you go downhill, the slope is getting less steeper(see figure 11). If you keep repeating this step you will reach our red dot eventually. Use a calculator and see how many steps it takes before it reaches x = 0, where output is minimum(that is also zero for this case).

figure 11
Also, notice that our slope is negative when we are on the left of the red dot. That is something you have to understand. That is there, that negative slope is telling us the direction.

What if we were on the right of the red dot?

Then the slope will be positive. And it will again tell us to move in the direction towards red dot.

For example, if our black dot was at (4,16).

Then, slope = 2 * x = 2 *4 = 8 (positive slope)

Learning Rate
It basically means the size of the step that you are going to take in any particular direction.

Like if you were actually in that place(position of the black dot), walking downhill and talking very small steps(small learning rate) it will take you longer to reach the minimum point(figure 13). And if you were taking very long steps(big learning rate), then you might always skip the minimum point and overshoot(figure 12).

figure 12
figure 13
The whole point of doing this graphically is to digest the idea of how derivatives help us to find the direction.

figure 14
When you plot the cost function(figure 6). It will look like the plot above(in figure 14) where horizontal axes represent input(m and c) and the vertical axis represents output(cost). You will always get a bowl like shaped curve for our cost function. This type of function is known as a convex function.

When you start at any random point on the curve. You will most probably always reach the minimum point on the curve(minimum value of cost). Now as you already know that we can use derivatives to find the direction of downhill(steepest direction going downhill).

There is one thing that you should note here that when you have more than one input/parameter, you will use partial derivatives instead of derivatives.

Now let’s formulate our gradient descent approach for our cost function.

figure 15
What we will do is to keep repeating the code inside curly brackets until we find the minimum of the function.

But how can I find the partial derivative of the cost function?

Well if you know how to find the derivative then it won’t be difficult for you to find the partial derivative of a function.

figure 16
Now let’s rewrite the equations in figure 15 by putting the value of partial derivatives from figure 16.

In this chapter, we learned about the simple linear regression. We also learned how to minimize a function using gradient descent.

In the next chapter, we will learn about —

Multiple Linear Regression
Normal Equation
Implementation of Linear Regression in Python