How do we correctly group these students?
First, you must understand that the objective is to reduce the variation for the dependent variable.
The variance is the spread of all the possible values someone could save. People could save as little as $100 to as much as $100,000. Our goal is to reduce the variance by averaging similar groups of people.
If the variation is significant, then calculating/predicting someone’s salary is more challenging to figure, as there’s more variance in correctly predicting someone’s saving.
Well, how do we reduce the variation?
Well, we use the dependent variables to create splits or branches. Your objective is to look for categorization (branches) that will reduce the total variation. Think about it. If your grouping people, wouldn’t you want most people to be similar. If so, you understand Decision Tree Regressor.
It gets broken down by categories for continuous variables.
We must calculate the standard variation (square root of the variance) of the predictor variable. Then, we must find the split that can best reduce the total variance.
We have ten people who provided the amount the saved:
Overall metrics (aggregate):
We compute the parameters above, but we compute the categorization of the discrete variables: For example, we compute the calculations for gender (male and female) below:
We must then calculate the weighted average for their standard deviation:
You repeat the step above for all the independent variables. The variable that has the greatest reduction in standard deviation is the most optimal split (branch).
To terminate the splitting process, the CV could be used as a ‘stop’ signal:
- If all the possible split’s CV is under a certain number, we can stop the splitting process.