When studying Data Science, we learn that the results of our Machine Learning models are highly influenced by the quantity and quality of our data. This means we have to make sure we take good care of our datasets, dealing with outliers and missing values, as well as other preprocessing steps.
The first step in any Data Science project is defining and understanding the problem, so you can come up with the best possible solution to it. With that in mind, you can choose the best model to be used in this project. This is extremely important, because each model works differently, and is more or less sensitive to different aspects of the dataset.
First, you have to know whether your model of choice is Tree-Based or not. That means that your model is based on Decision Trees. Decision Trees don’t depend on scaling and don’t show substantial improvement when working on preprocessed data.
However, when it comes to Non-Tree-Based models, such as Linear models, KNN and Neural Networks, preprocessing your data will bring great improvements to your model’s results.
But what techniques can we apply? Let’s talk about some of the most used ones.
MinMaxScaler from Sklearn is a preprocessing technique that puts the data on the same scale. To quote its documentation:
It transforms features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
For better visualization, imagine you have a dataset with Real Estate Listings from Los Angeles. It’d have values for prices that range from a couple hundred thousand to millions of dollars. This discrepancy may make it harder for the model, and putting everything on the same scale (e.g. from 0 to 1) will make you model work the data in a better way, yielding more powerful results.
Of course, it would transform all the data to make the algorithm run on scaled data.
Standard Scaler, just like MinMax, transforms the data to put it on the same scale. However, it works a little differently. In the documentation, they explain:
Standardize features by removing the mean and scaling to unit variance.
This means that for a given feature, the Mean would be 0, and the Standard Deviation would be 1. This standardizes the features, making them more manageable for our models. Again, Non-Tree-Based models benefit the most from this kind of standardization.
The Rank method from the Scipy Library sets spaces between properly assorted values to be equal. To quote their documentation:
It assigns ranks to data, dealing with ties appropriately.
Ranks begin at 1. The method argument controls how ranks are assigned to equal values.
If you haven’t dealt with outliers so far, Rank works better than MinMaxScaler since it moves outliers closer to the other objects. Again, it works great for Linear models, KNN and Neural Networks.
In their course, the instructors from the Higher School of Economics in Russia also mention some good practices that might help the models even further:
Train the model on concatenated data frames produced by different preprocessing techniques. For example, use MinMaxScaler, then Standard Scaler and finally Rank to preprocess your data separately. Then, concatenate the results and train your model on this data frame.
Ensemble a mix of models trained using different types of preprocessing, in order to take advantage of all the pros each technique brings to the table.
In a nutshell, these techniques should help you improve the performance of your models, and have better results. The goal of this article was to present a refresher or a general guideline on where to start when preprocessing your data. You can dive deeper into this subject in this great article by Towards Data Science.
Be on the lookout for the next article, where we’ll tackle preprocessing techniques for Categorical Data. Also, make sure to connect with me on LinkedIn, and check out my projects on GitHub.