What is Machine Learning? Machine Learning is the process of letting your machine use the data to learn the relationship between predictor variables and the target variable. It is one of the first steps toward becoming a data scientist.
There are two kinds of Machine Learning: supervised, and unsupervised learning. In supervised learning, there are two types: Regression and Classification. In this blog, I will be focusing on regression models.
In weeks 3 and 4 of my general assembly data science immersive program, we learned about the sklearn library and used machine learning on the regression. While using most machine learning models, we cannot use string datatypes. So, to deal with data that are not numeric, we use feature engineering or create dummy variables. By using feature engineering, we can convert an object(string) into a numerical value. By creating a dummy variable, it creates a binary column of 1s and 0s for the column.
To showcase the new skills we learned from the sklearn library, our class had a little Kaggle competition to see who had the best model. The project is:
AMES HOUSING DATA SALE PRICE PREDICTION:
My project’s problem statement was:
“A realtor is looking to renovate and build houses in Ames, Iowa. They want me to look at the data to see what to invest in to get the best R.O.I. Which features will raise the price of the house value?”
The data is from Kaggle (‘https://www.kaggle.com/c/dsir-28-project-2-regression-challenge/data’). The train data contains all of the data that I would use to create my model. The test data is the data that I want my model to predict on.
1. Top 5 Open-Source Machine Learning Recommender System Projects With Resources
2. Deep Learning in Self-Driving Cars
3. Generalization Technique for ML models
4. Why You Should Ditch Your In-House Training Data Tools (And Avoid Building Your Own)
Different types of data:
Nominal: used for labeling variables(m- male and f- female)
Ordinal: used for measuring non-numeric with an order of the values(1-unhappy, 2-ok, 3- happy)
Data Cleaning: In this data set, there are 2051 rows with 80 columns. So, there were a lot of missing and null values to be clean. By reading the data dictionary provided on the Kaggle website, I cleaned the data using the pandas library. For discrete or ordinal features, I added the mode in order not to input a float number. I imputed the null values with the mean of the continuous column.
Exploratory Data Analysis: The highest correlation between the sale price and features was the overall quality of the house.
Features that have the best correlation to the Sale Price
FEATURE ENGINEERING: In order to predict the outcome variable (y variable) you need to turn all the object type columns into numeric in order to predict the Sale Price. So first, I converted all the ordinal columns into numeric by assigning them by numbers, and for all the other features I created dummy variables.
MODELING: After creating all the features to predict the sale prices, using sklearn, we train-test-split the training data. In Regression, there are a lot of models you can choose from to get the best performing model. In this project, I ran Linear Regression, Ridge, and Lasso models.
Training Score: How the model fits the training data
Testing Score: How the model generalizes to new data
RMSE: Shows how far predictions fall from measured true values using Euclidean distance.
The baseline model had an RMSE of $80,000. A baseline is used to compare the RMSE to my model’s RMSE to show how effective my model is. So all of the models perform extremely better than the baseline, telling us that our model is predicting generally better than the baseline model.
Lasso performs better than Linear Regression or Ridge as it gives an R2 score of 92% on the test and 21728 on the RMSE. I chose Lasso because the model had a lot of features and the lasso model shrinks and removes the coefficients, which reduces variance without a substantial increase of the bias.
This shows the relationship between predicted and actual sale price; as you can see it is pretty linear. The model is good because the predicted and actual sale prices are really close to each other.
Below is the link for GitHub for the project if you want to check it out.