Topics covered in this chapter are –
- Simple Linear Regression
- Cost Function
- Gradient descent
Let’s plot this data on a graph.
The above figure shows the height and weight of 10 aliens. Now, can you predict the height of an alien whose weight is 120 kg? Well, if you see the figure above, you can tell me the height of the alien is 240cm. But how did you figure that out? How could you predict the height of an alien?
Well, let’s figure it out. In the above figure, the height of an alien having weight 20kg is 40cm and the height of an alien having weight 30kg is 60cm and so on…..
So, if you join all the blue dots then what you’ll have is a line and like any line, it can be represented with an equation and that is what we call a linear equation.
y = mx + c
The equation above I have written is a general equation of a straight line.
Well, let’s assume that you did not go to high school. So let me explain how this general equation can represent a straight line.
We will use figure 2 for the explanation.
y = output/target variable (height of the alien)
x = input variable (weight of the alien)
m = coefficient of x
c = intercept
We need to come up with the values of m and c. So, if we put the value of x(weight) in the equation, we’ll get the value of y(height).
By looking at figure 2, we can see that it’s a simple trend and we can come up with those values very easily. You can see that for every value of x, y is twice that value.
So m = 2 and c = 0.
Our equation becomes…
y = 2 * x + 0
height = 2 * weight + 0
Let’s check if our equation is correct. If weight = 30kg
height = 2 * weight + 0 = 2 * 30 = 60cm
And 60 is correct as you can verify it by looking at figure 1.
Now, you can perfectly predict the height of an alien with 100% precision.
That was not difficult at all. I know!
Well, we ain’t aliens. Are we? (Except Elon Musk). Let’s dive into our world and see if we can still come with an equation as we did before.
Don’t panic! I know that there is no way you can fit a straight line here as you did earlier. Those blue dots are scattered around. I know that in the alien world things were easy. But humans are complicated. Aren’t they(we)?
So what should we do now?
Well, let’s start by fitting any straight line in figure 3.
Let’s understand what is going on in figure 4. The straight line drawn in figure 4 can be represented with an equation in the form of
y = mx + c
height = m * weight + c
But the problem is that if you start putting the values of x in the equation then you won’t get the accurate values of y(assuming that we have found the values of m and c).
How inaccurate those values will be?
In figure 4, you will see that for input x = 56(weight) our equation of the straight line(drawn in figure 4) will output 152(height) but the actual height is 178. So the error is the vertical distance from the blue dot at (56,178) to the black dot at (56,152) which can be calculated by using the euclidean distance formula.
error² = (x2-x1)² + (y2-y1)²
Let’s find the error in the above coordinates we discussed above.
error² = (56–56)² + (178–152)² = 0 + 26² = 676
The difference between x2 and x1 will always be zero because the shift in coordinates is only vertical, not horizontal. x2 and x1 are basically the same input. But y2 and y1 are different outputs. y2 is actual height and y1 is predicted height. So you can change the above error formula by
error² = (y2 — y1)²
error² = (actual height — predicted height)²
And when you find this error for all the blue dots in figure 4 and add them. What you get is the total error in your training set. That total error is known as the cost of the model.
That weird sign at the start is known as sigma which basically means summing over a range of values. Like here, the range starts from 1 to m, where m is total no. of training examples in your data set. The superscript (i) basically means the i-th training example. We have 10 training examples in table 1. In reality, we have hundreds and thousands of training examples.
Now if you can find a straight line for which the cost(in figure 5) is minimum then you have found a line that best fits the data. The cost is minimum only when the difference between the actual and predicted value is low.
Now our main goal is to minimize the Cost function. You can rewrite the cost function as
In figure 6, you can see that to minimize the cost function. You have to minimize the difference between the actual and predicted value. You can change the values of m and c such that the difference between the actual and predicted value is low.
Why m and c?
Because x is input and y is output. So x and y can not be changed. So the value of m and c can be tweaked.
And yes, these parameters are “learned” by your algorithm. What I mean is that the machine has to figure out the values of these parameters itself.
Before discussing how these parameters are learned by our algorithm. Let’s make some changes to the cost function.
This cost function is the mean(average) of the squares of the error terms divided by 2. This function is also called — Can you guess?. Mean Squared Error(M.S.E).
Now let’s talk about how these parameters are learned.
Let’s plot the above equation on a graph.
Let’s just understand what just happened here.
We have a quadratic function y = x² (you can also write this as f(x) = x² where x is the input) which when drawn on a graph will give us a parabolic shape(figure 7).
As you already know that our goal is to minimize the cost function. To learn how to minimize a function with multiple inputs(like m and c in our cost function). We will first understand how to minimize a function with one input(like our quadratic equation y = x²). Remember, whenever I say “minimize a function”, you should think about what value of input will give the lowest output.
Like for our function f(x) = x², if we put x = 0(input), then,
f(0) = 0² = 0
we get 0 as output.
What if we use negative numbers as input here?
Well, that won’t make a difference because our function is squaring whatever input you feed it. So if you feed it a negative number then squaring it will give a positive number which is greater than zero.
f(-2) = (-2)² = 4
In figure 7, you can see that our function is minimum(means y = 0 or f(0) = 0) at x = 0. Well, that was easy, isn’t it? You just put zero as input and get zero as output. Well, it ain’t that easy. Let me explain why?
If you put m and c equals to zero in our cost function(in figure 6.1). Will you have zero as output? No, because then mx+c will be 0, no matter what value of x you put there. So you did not minimize the difference between y(actual output) and mx+c(predicted output). Pick up a pen and paper and think it through before moving forward.
But we can plot our function graphically and find at which our function is minimum?
It’s more than a “yes” or “no ” question.
Let me explain.
For a function like f(x) = x², where you have only one input and one output, you can represent input on the x-axis and output on the y-axis(like in figure 7). In other words, you can visualize it on 2D.
You can also do the same for our cost function — representing m(input) and c(input) on the x-axis and y-axis respectively and cost(output) on the z-axis. In other words, 3D.
Now if you have a function with 3 inputs(and 1 output). Can you tell me the dimension you are going to need to visualize it?
Yeah, it’s 4D and there is no way you can visualize that. So our visualization is limited to only 3D.
Is there any other way to find the minimum of the function?
Yes, there is. That other way is gradient descent.
See the black dot in figure 8. Assume that you are standing in that place and you want to go to that red dot(because the output( y or f(x) ) is minimum at red dot). At any given instant, you will have two options, either go up or down, either take a big step(green arrow) or a small step(red arrow) in a direction where your output is decreasing.
I know it’s somewhat confusing. That how gradient descent is going to figure out which way is down or which is up? And even it does figure out how it’s going to decide if it should take a bigger step or a smaller step?
Just bear with me for a few more minutes. I will start with how it’s going to decide which direction it should take?
f(x) = x²
f(-4) = (-4)² = 16
(-4,16) is the position of our black dot where -4 is the input and 16 is output. Don’t forget that this black dot is randomly chosen. We can start anywhere.
Now I want you to bring out the calculator before moving on.
If you can find a way to know in which direction the slope is getting less steep then you know in which direction you have to go because if the slope is less steep then it’s telling us that we are going downhill. Like the slope at that red dot is 0.
How to find the slope?
You can find the slope using the derivative. It’s okay if you don’t know calculus. Just for now, remember that you can find the slope using the derivative of the function.
lr = learning rate
Let’s take lr = 0.1
current_input = -4 and output=16
Step 1:- current_input = -4-(0.1 * 2 *- 4) = -3.2
Step 2:- Output at current input = f(-3.2) = (-3.2)² = 10.24
Repeat Step 1 and 2 again.
current_input = -3.2-(0.1 * 2 * -3.2) = -2.56
Output = f(-2.56) = (-2.56)² = 6.55
Can you see what is happening here?
With each step, our black dot is going downhill. You should also notice that with each step our black dot is taking smaller and smaller steps because as you go downhill, the slope is getting less steeper(see figure 11). If you keep repeating this step you will reach our red dot eventually. Use a calculator and see how many steps it takes before it reaches x = 0, where output is minimum(that is also zero for this case).
Also, notice that our slope is negative when we are on the left of the red dot. That is something you have to understand. That is there, that negative slope is telling us the direction.
What if we were on the right of the red dot?
Then the slope will be positive. And it will again tell us to move in the direction towards red dot.
For example, if our black dot was at (4,16).
Then, slope = 2 * x = 2 *4 = 8 (positive slope)
It basically means the size of the step that you are going to take in any particular direction.
Like if you were actually in that place(position of the black dot), walking downhill and talking very small steps(small learning rate) it will take you longer to reach the minimum point(figure 13). And if you were taking very long steps(big learning rate), then you might always skip the minimum point and overshoot(figure 12).
The whole point of doing this graphically is to digest the idea of how derivatives help us to find the direction.
When you plot the cost function(figure 6). It will look like the plot above(in figure 14) where horizontal axes represent input(m and c) and the vertical axis represents output(cost). You will always get a bowl like shaped curve for our cost function. This type of function is known as a convex function.
When you start at any random point on the curve. You will most probably always reach the minimum point on the curve(minimum value of cost). Now as you already know that we can use derivatives to find the direction of downhill(steepest direction going downhill).
There is one thing that you should note here that when you have more than one input/parameter, you will use partial derivatives instead of derivatives.
Now let’s formulate our gradient descent approach for our cost function.
What we will do is to keep repeating the code inside curly brackets until we find the minimum of the function.
But how can I find the partial derivative of the cost function?
Well if you know how to find the derivative then it won’t be difficult for you to find the partial derivative of a function.
Now let’s rewrite the equations in figure 15 by putting the value of partial derivatives from figure 16.
In this chapter, we learned about the simple linear regression. We also learned how to minimize a function using gradient descent.
In the next chapter, we will learn about —
- Multiple Linear Regression
- Normal Equation
- Implementation of Linear Regression in Python