- Supervised/Unsupervised: Supervised
- Regression/Classification: Regression
The goal is to best fit the line on the data values. Meaning, we calculate the “line” that produces the least error (sum of total).
Formula — (Linear Regression):
- X’s variables (Inputs): These are the inputs values that are in the dataset. If you’re predicting employee salaries, some inputs could be age, level of education, or their location.
- β0 (Bias Term): A Bias term is needed unless we believe that the model starts at the origin. This bias term is where the line intercepts the y-axis. The y-intercepts matters. For example, if we were to predict a babies weight — a baby cannot weigh 0lbs (assuming they were born).
- Β1… Bp-1 variables (coefficients): We multiply the X variables with the weights/betas (represented by β). The betas are what the model is calculating. As you’ll later see, it’s how we manipulate the line below to fit the dependent variable.
- The epsilon (ϵ): Represents the error term or the inability to be 100% of our data. We can never expect that our data is an accurate representation of the population.
The objective is to predict the betas (weights) that “fits” the data values the best.
They are the coefficients to be multiplied with the inputs. If there were only one predictor, the goal would be to find a linear line that best “fits” the data points.
- The evaluation method must be established: cost function. The cost function is typically the Mean Squared Error (MSE).
Formula — MSE (Mean Squared Error):
The MSE formula measures the average squared difference of the summation of the observed and actual.
- y1: The ground truth
- W*X: Multiplying our weights by the X variables to get a prediction.
- W0: Not shown in the function, but we will use a bias term.
So the process is relatively straight-forward. Let’s use an example:
- You’re tasked with predicting people’s salary. Age is your independent variable. You will probably have more variables than age, but let’s stick with one variable. You will use the training set to measure accuracy.
- If you were only to find the pattern for one person whose salary is $100,000 and is 50-years old, the coefficient (beta) would be 2000. Meaning, for every additional age, a person’s expected to earn $2000 more.
- There are now are two people. The second person earns $60,000 but is 20-years-old. Hence, our weight of 2000 does not work anymore.
Our job is to find a weight that minimizes the error in average for all the observations.
The task becomes very complicated when there are a lot of variables and observations.
Additional Notes (MSE):
- We square the difference to calculate their positive value. The capital sigma, Σ, means we sum the variations of the predictions and ground truths. Lastly, we divide by the total number of observations (N) to get the average.
- The reason we use “2N” instead of N to facilitate when calculating the derivatives.
Our goal is to minimize the Cost Function (MSE).
“So how do we minimize the MSE?”
With the gradient descent.
Formula — Gradient Descent (Simple Regression):
The diagram illustrates (look below) the goal of the gradient descent. We need to “reach” the bottom of the parabola. The slopes are drawn by the straight colored lines (green, yellow, and red). The steeper the slope, the quicker we reach to the bottom. Hence, the objective is to find the slope that can help us “reach” the bottom the quickest. The goal is to reach the base as efficiently as possible from the current position.
For a single linear regression problem, we calculate the derivatives of our independent variables (including the bias). The equations below are the derivatives for m and b. Realize that m is the same as betas/weights in single linear regression (terminology is often inconsistent).
- We calculate the derivatives of the bias and weights for each observation. We then sum the derivatives to derive the average.
- The last step is updating the betas. These derivatives calculate how we should tweak the betas.
- The formula is original_beta minus (learning_rate * derivative).
- The derivative is the direction we want to move towards, and the learning rate is how fast we would like to proceed. The learning rate is typically 0.05 but could be adjusted.
Formula — Gradient Descent (Multi-Linear Regression):
More realistically, you’ll be dealing with a Multi-Linear Regression problem. Hence, the equations below described the partial derivatives of each of the possible weights, which are associated with each independent variable. The “h0(x)” term is our prediction.
If you’re rock climbing, you can either walk forward, backward or up and down. The partial derivatives described how much you should in a 3-Dimensional plane.
Is there anything else?
Yes. One of the difficulties in machine learning is a concept called overfitting. Overfitting is when our model performs well on the training set, the dataset the models learn, but not as well in the validation or testing set, the dataset that the models are validated with or tested with.
One technique to reduce overfitting is by shrinking your weights/betas.
It’s a fascinated concept. The strongest beta/weigh will still have the most robust beta/weight among its the other independent variables.
However, the magnitude will be reduced for all independent variables.
We have the ridge/lasso regression.