Wednesday, March 3, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Data Science

What does the Machine Learning process look like?

August 5, 2019
in Data Science
What does the Machine Learning process look like?
585
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

This is the second article about Machine Learning. If you would like to start from introduction than you can find the first article here.

I have mentioned that every Machine Learning process is built from several steps like:

You might also like

The Effect IoT Has Had on Software Testing

Why Cloud Data Discovery Matters for Your Business

DSC Weekly Digest 01 March 2021

  • what would you like to achieve (define the goal)
  • prepare the data
  • select an algorithm(s)
  • build and train the model
  • test the model (and score it)

Let’s review them one by one. I should mention that you will be able to find information over the internet that the number of steps is different from what you see on this blog. For example you can separate building and testing a model but at the end you need to do this no matter if it is one or more steps. Same you can say about the testing and evaluating your model.

What would you like to achieve (define the goal)

Well. This is the most important part of the process! At least from the business (or problem solving) point of view. Please do not be angry I am using the word “business” as someone has to finally pay for your work as a data scientist.

IMPORTANT DISCLAIMER – NOT ALL PROBLEMS WILL AND SHOULD BE SOLVED BY MACHINE LEARNING!

Think about it. Imagine you are a care driver and see a traffic light. There are three colors – red, yellow and green. When the light is red you should wait. When the light is yellow you should be carefull and not enter the crossroads. When the light is green you can drive (safely). There is fourth state of the light as well – the red and yellow lights are on which means that green will be in a moment. The yellow can pulse which is an error state. You also know the sequence: green -> yellow -> red -> red and yellow -> green…

Now try to implement a system that by knowing the light color tells you whether you can go, prepare to go or wait. Do you need to create a Machine Learning model? Or maybe a neural network? No – the system is just a pretty simple an algorithm based on few rules.

Now you know – you would like to solve some problem. The better it is defined the greater are chances if the success. It can be a simple question like:

  • Based on my current medical examination results will I live over 100 years with 95% probability?
  • Is this mushroom edible or poisonous?
  • Is this email a spam?
  • How much is my car worth?

The problem can be however more complicated. Let’s look on the picture below. Guess which one is a chihuahua and which one is a blueberry muffin.

We as a humans can see the difference but for a program this could be a tricky and extremely challenging. What about a tasks that says: “Is this body cell malignant or healthy?” Machine Learning helps here.

Prepare the data

Wait! What data? Do I have the data already? You should have! Someone has already defined the problem and based on this knowledge data sets should be identified.

You can have files, relational databases, NoSQL databases, graph data… whatever it can be!

Believe me or not this is where the problems really start! The first question should be – what is the data quality? I have data from my company – can I trust the data?

What about public data sources like the one you can find in the internet? Take a look into the 1 minute movie I did for you. It is all about public data.

I have not shuffled the deck. The cards were there all the time. It was just like you have seen. Sometimes you need to have some good data and think that a public data set can provide this to you. Public data means you can easily get cheated and the quality of the entire process will be very very low. Not neccesarilly will be but…. You know. It can be.

You have succesfully gathered the data and need to do some preparations now. I will post not one but many articles about the techniques of data preparation. There are lots of methods here but you should know your data set – what is the origin, what information if contains – which attributes are important? If you do not know the data set very well – how to know it better ( How to perform exploratory analysis? ). Can we reduce the number of attributes (PCA analysis)? Can we remove some data without creating a data skew? Can we introduce new features by combining the existing ones? How to perform mapping from string data to numerical data (One Hot Encoding)? Should we perform some regularization or data standarization?

Oh boy, so many topics to cover!

Select an algorithm(s)

Based on the question you have been given (the goal) you should consider not one but more algorithms to use. There are dozens of them so how to pick the good set of the algorithms? The simplest approach here is to know whether you do a classification or a regression.

A classification is when you assign the output to one of the groups – like in our example – the email can be spam or not a spam. The classification process takes into account all input features and decides whether a new email (never seen before) is a spam or not.

A regression algorithm can predict (estimate or guesstimate if you will) a number based on the input features’ values. For example how much my car will be worth next year if it is now 3 years old and has 6.8 liters diesel engine and it is white (and many more…).

Let me name some classification algorithms here so we can play them later:

  • Naive Bayes
  • k-Nearest Neighbors
  • SVM – Support Vector Machines
  • Decission Trees
  • Random Forrest
  • Logistic Regression (yes, it is a classification algorithm)
  • Neural Network – wait! Is it an algorithm?

Here you are – some regression algorithms:

  • Linear Regression Model
  • Lasso Regression
  • Ridge Regression
  • Polynomial Regression
  • ElasticNet Regression

But how yo pick the one?

Build and train the model

You have a data set that contains input features and the information about an outcome (an output feature). Having this in mind let’s build a model.

To do so you need to split your data set into two parts called training and testing data set. Typically the training data set contains 70% of your data and the testing set has the remaining part. Of course it is not always like this and you can assign less data to the training data set especially if you have a lot of data.

Now you ask yourself – how to divide the data set correctly? Not in terms of numbers (70%-30% split) but in terms of data quality. The good thing is that existing frameworks like scikitlearn helps us in many aspects. I will concentrate on this part in later post.

Once you have a training and testing data sets you can choose a model you would like to build. This is really the easy part when you have chosen an algorithm and you know a framework like scikitlearn a bit.

Building a model is to create an object of a specific type and feed it with data from the training data set. Sometimes a model is trained just once and sometimes it is done iteratively like in the k-fold cross validation process (more on this later).

Test the model (and score it)

Once the model has been trained you need to test it on the data that has never been seen by the model. It is an analogy to an exam. You can prepare yourself to an exam by study books or doing research. Then you go to the exam and your knowledge is tested. The result of the test is how good your knowledge is. The higher score the better expert you are. But if your score is not so good you need to learn more or to change the approach.

The same process you should apply on your model. You need to evaluate it and see whether of really camn ask the question you have from the inintial step of this process.

But what if the model is not working as expected? Then you have two options:

  • run the learning process once again on the same model but tune its parameter (it is called hyperparameters tuning)
  • change the model and find better one

In the one of the next articles I show you how to automate this process in Python.

What’s next?

Are you overwhelmed by this post? I will explain all the main steps in the next articles so do not get confused! Now you should be relaxed as we will be using existing frameworks that speed up Machine Learning & AI steps I have described.

There will be even more new tasks to cover so please stay tuned. For example I will be discussing (apart from many other things I have mentioned above):

  • automated Machine Learning in the Microsoft Azure Cloud
  • manual deployments
  • consuming Machine Learning models in the apps

Cheers,
Damian

Originally posted here


Credit: Data Science Central By: Damian Widera

Previous Post

English Football Giant Newcastle United Scores Crypto Trading Sponsor

Next Post

Google, Arm team up to tackle memory vulnerabilities through MTE

Related Posts

The Effect IoT Has Had on Software Testing
Data Science

The Effect IoT Has Had on Software Testing

March 3, 2021
Why Cloud Data Discovery Matters for Your Business
Data Science

Why Cloud Data Discovery Matters for Your Business

March 2, 2021
DSC Weekly Digest 01 March 2021
Data Science

DSC Weekly Digest 01 March 2021

March 2, 2021
Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game
Data Science

Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game

March 2, 2021
Importance of Data Science in Modern Age
Data Science

Importance of Data Science in Modern Age

March 2, 2021
Next Post
Google, Arm team up to tackle memory vulnerabilities through MTE

Google, Arm team up to tackle memory vulnerabilities through MTE

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

New app rollout helps reduce paperwork for NSW frontline child protection caseworkers
Internet Security

New app rollout helps reduce paperwork for NSW frontline child protection caseworkers

March 3, 2021
Cloudera: An Enterprise-Level Play On Machine Learning And Big Data – Seeking Alpha
Machine Learning

Cloudera: An Enterprise-Level Play On Machine Learning And Big Data – Seeking Alpha

March 3, 2021
The Symbolic World: Raising A Turing’s Child Machine (1/2) | by Puttatida Mahapattanakul | Feb, 2021
Neural Networks

The Symbolic World: Raising A Turing’s Child Machine (1/2) | by Puttatida Mahapattanakul | Feb, 2021

March 3, 2021
Top 10 ‘Brand Guardian’ Most Famous, Most Reputable CEOs
Marketing Technology

Top 10 ‘Brand Guardian’ Most Famous, Most Reputable CEOs

March 3, 2021
Linux Mint may start pushing high-priority patches to users
Internet Security

Linux Mint may start pushing high-priority patches to users

March 3, 2021
Microsoft Ignite Data and Analytics roundup: Platform extensions are the key theme
Big Data

Microsoft Ignite Data and Analytics roundup: Platform extensions are the key theme

March 3, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • New app rollout helps reduce paperwork for NSW frontline child protection caseworkers March 3, 2021
  • Cloudera: An Enterprise-Level Play On Machine Learning And Big Data – Seeking Alpha March 3, 2021
  • The Symbolic World: Raising A Turing’s Child Machine (1/2) | by Puttatida Mahapattanakul | Feb, 2021 March 3, 2021
  • Top 10 ‘Brand Guardian’ Most Famous, Most Reputable CEOs March 3, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates