Monday, March 1, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Neural Networks

Data Preprocessing: A Basic Guideline

February 13, 2020
in Neural Networks
Data Preprocessing: A Basic Guideline
585
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

When studying Data Science, we learn that the results of our Machine Learning models are highly influenced by the quantity and quality of our data. This means we have to make sure we take good care of our datasets, dealing with outliers and missing values, as well as other preprocessing steps.

In this article, I’d like to highlight some key ideas and topics brought up by the instructor from the Coursera course “How To Win A Data Science Competition”. I’ve been studying this course and found these tips extremely valuable. In this first article, we’ll focus on preprocessing Numeric Features. We’ll deal with Categorical Features later on.

You might also like

How AI Can Be Used in Agriculture Sector for Higher Productivity? | by ANOLYTICS

Future Tech: Artificial Intelligence and the Singularity | by Jason Sherman | Feb, 2021

Tackling ethics in AI algorithms: the case of Salesforce | by Iflexion | Feb, 2021

The first step in any Data Science project is defining and understanding the problem, so you can come up with the best possible solution to it. With that in mind, you can choose the best model to be used in this project. This is extremely important, because each model works differently, and is more or less sensitive to different aspects of the dataset.

First, you have to know whether your model of choice is Tree-Based or not. That means that your model is based on Decision Trees. Decision Trees don’t depend on scaling and don’t show substantial improvement when working on preprocessed data.

However, when it comes to Non-Tree-Based models, such as Linear models, KNN and Neural Networks, preprocessing your data will bring great improvements to your model’s results.

But what techniques can we apply? Let’s talk about some of the most used ones.

MinMaxScaler from Sklearn is a preprocessing technique that puts the data on the same scale. To quote its documentation:

It transforms features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

For better visualization, imagine you have a dataset with Real Estate Listings from Los Angeles. It’d have values for prices that range from a couple hundred thousand to millions of dollars. This discrepancy may make it harder for the model, and putting everything on the same scale (e.g. from 0 to 1) will make you model work the data in a better way, yielding more powerful results.

Of course, it would transform all the data to make the algorithm run on scaled data.

Standard Scaler, just like MinMax, transforms the data to put it on the same scale. However, it works a little differently. In the documentation, they explain:

Standardize features by removing the mean and scaling to unit variance.

This means that for a given feature, the Mean would be 0, and the Standard Deviation would be 1. This standardizes the features, making them more manageable for our models. Again, Non-Tree-Based models benefit the most from this kind of standardization.

The Rank method from the Scipy Library sets spaces between properly assorted values to be equal. To quote their documentation:

It assigns ranks to data, dealing with ties appropriately.

Ranks begin at 1. The method argument controls how ranks are assigned to equal values.

If you haven’t dealt with outliers so far, Rank works better than MinMaxScaler since it moves outliers closer to the other objects. Again, it works great for Linear models, KNN and Neural Networks.

In their course, the instructors from the Higher School of Economics in Russia also mention some good practices that might help the models even further:

  • Train the model on concatenated data frames produced by different preprocessing techniques. For example, use MinMaxScaler, then Standard Scaler and finally Rank to preprocess your data separately. Then, concatenate the results and train your model on this data frame.
  • Ensemble a mix of models trained using different types of preprocessing, in order to take advantage of all the pros each technique brings to the table.

In a nutshell, these techniques should help you improve the performance of your models, and have better results. The goal of this article was to present a refresher or a general guideline on where to start when preprocessing your data. You can dive deeper into this subject in this great article by Towards Data Science.

Be on the lookout for the next article, where we’ll tackle preprocessing techniques for Categorical Data. Also, make sure to connect with me on LinkedIn, and check out my projects on GitHub.

Don’t forget to give us your 👏 !

Credit: BecomingHuman By: Rafael Duarte

Previous Post

Agency-Client Relationships: When Marketing Is in the Boardroom

Next Post

Wise Practitioner – Predictive Analytics Interview Series: Gil Reich at Wix - Machine Learning Times

Related Posts

How AI Can Be Used in Agriculture Sector for Higher Productivity? | by ANOLYTICS
Neural Networks

How AI Can Be Used in Agriculture Sector for Higher Productivity? | by ANOLYTICS

February 27, 2021
Future Tech: Artificial Intelligence and the Singularity | by Jason Sherman | Feb, 2021
Neural Networks

Future Tech: Artificial Intelligence and the Singularity | by Jason Sherman | Feb, 2021

February 27, 2021
Tackling ethics in AI algorithms: the case of Salesforce | by Iflexion | Feb, 2021
Neural Networks

Tackling ethics in AI algorithms: the case of Salesforce | by Iflexion | Feb, 2021

February 27, 2021
Creative Destruction and Godlike Technology in the 21st Century | by Madhav Kunal
Neural Networks

Creative Destruction and Godlike Technology in the 21st Century | by Madhav Kunal

February 26, 2021
How 3D Cuboid Annotation Service is better than free Tool? | by ANOLYTICS
Neural Networks

How 3D Cuboid Annotation Service is better than free Tool? | by ANOLYTICS

February 26, 2021
Next Post
Wise Practitioner – Predictive Analytics Interview Series: Gil Reich at Wix – Machine Learning Times

Wise Practitioner – Predictive Analytics Interview Series: Gil Reich at Wix - Machine Learning Times

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

AI And Automation In HR: The Changing Scenario Of The Business
Data Science

AI And Automation In HR: The Changing Scenario Of The Business

February 28, 2021
Machine learning could aid mental health diagnoses: Study
Machine Learning

Machine learning could aid mental health diagnoses: Study

February 28, 2021
Python vs R! Which one should you choose for data Science
Data Science

Python vs R! Which one should you choose for data Science

February 28, 2021
Can Java be used for machine learning and data science?
Machine Learning

Can Java be used for machine learning and data science?

February 28, 2021
These four new hacking groups are targeting critical infrastructure, warns security company
Internet Security

These four new hacking groups are targeting critical infrastructure, warns security company

February 28, 2021
The Time-Series Ecosystem – Data Science Central
Data Science

The Time-Series Ecosystem – Data Science Central

February 28, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • AI And Automation In HR: The Changing Scenario Of The Business February 28, 2021
  • Machine learning could aid mental health diagnoses: Study February 28, 2021
  • Python vs R! Which one should you choose for data Science February 28, 2021
  • Can Java be used for machine learning and data science? February 28, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates