Saturday, January 16, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Data Science

Cross-Validation: Concept and Example in R

February 19, 2019
in Data Science
Cross-Validation: Concept and Example in R
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Credit: Data Science Central

This article was written by Sondos Atwi.

You might also like

How Artificial Intelligence Can Benefit Education

All about Use Of Data Science

20 Of The Most Important Machine Learning Interview Questions and Answers

In Machine Learning, Cross-validation is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred.

The concept of cross-validation is actually simple: Instead of using the whole dataset to train and then test on same data, we could randomly divide our data into training and testing datasets.

There are several types of cross-validation methods (LOOCV – Leave-one-out cross validation, the holdout method, k-fold cross validation). Here, I’m gonna discuss the K-Fold cross validation method.
K-Fold  basically consists of the below steps:

  1. Randomly split the data into k subsets, also called folds.
  2. Fit the model on the training data (or k-1 folds).
  3. Use the remaining part of the data as test set to validate the model. (Usually, in this step the accuracy or test error of the model is measured).
  4. Repeat the procedure k times.

 

 

How can it be done with R?

In the below exercise, I am using logistic regression to predict whether a passenger in the famous Titanic dataset has survived or not. The purpose is to find an optimal threshold on the predictions to know whether to classify the result as 1 or 0.

Threshold Example: Consider that the model has predicted the following values for two passengers: p1 = 0.7 and p2 = 0.4. If the threshold is 0.5, then p1 > threshold and passenger 1 is in the survived category. Whereas, p2 < threshold, so passenger 2 is in the not survived category.

However, and depending on our data, the 0.5 ‘default’ threshold will not always ensure the maximum the number of correct classifications. In this context, we could use Cross-validation to determine the best threshold for each fold based on the results of running the model on the validation set.

In my implementation, I followed the below steps:

  1. Split the data randomly into 80 (train and validation), 20 (test with unseen data).
  2. Run cross-validation on 80% of the data, which will be used to train and validate the model. 
  3. Get the optimal threshold after running the model on the validation dataset according to the best accuracy at each fold iteration.
  4. Store the best accuracy and the optimal threshold resulting from the fold iterations in a dataframe.
  5. Find the best threshold (the one that has the highest accuracy) and use it as a cutoff when testing the model against the test dataset.

Note: ROC is usually the best method to be used to find an optimal ‘cutoff’ probability, but for sake of simplicity, i am using accuracy in the code below.  

The below cross_validation method will:

  1. Create a ‘perf‘ dataframe that will store the results of the testing of the model on the validation data.
  2. Use the createFolds method to create nbfolds number of folds.
  3. On each of the folds:
    • Train the model on k-1 folds
    • Test the model on the remaining part of the data
    • Measure the accuracy of the model using the performance method.
    • Add the optimal threshold and its accuracy to the perf  dataframe.
  4. Look in the perf dataframe for optThresh – the threshold that has the highest accuracy.
  5. Use it as cutoff when testing the model on the test set (20% of original data).
  6. Use F1 score to measure the accuracy of the model.

Then, if we run this method 100 times we can measure our max model accuracy when using cross-validation:

To read the whole article, with source code and examples, click here.

DSC Resources

Follow us: Twitter | Facebook

 


Credit: Data Science Central By: Andrea Manero-Bastin

Previous Post

How AI Can Help Solve Some of Humanity's Greatest Challenges

Next Post

How to protect your Google Account with the Advanced Protection Program

Related Posts

How Artificial Intelligence Can Benefit Education
Data Science

How Artificial Intelligence Can Benefit Education

January 15, 2021
All about Use Of Data Science
Data Science

All about Use Of Data Science

January 13, 2021
20 Of The Most Important Machine Learning Interview Questions and Answers
Data Science

20 Of The Most Important Machine Learning Interview Questions and Answers

January 12, 2021
How poor data quality impacts your business?
Data Science

How poor data quality impacts your business?

January 12, 2021
FinTech: How AI is Improving This Industry
Data Science

FinTech: How AI is Improving This Industry

January 12, 2021
Next Post
How to protect your Google Account with the Advanced Protection Program

How to protect your Google Account with the Advanced Protection Program

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

This new ransomware is growing in strength and could become a major threat warn researchers
Internet Security

Ransomware attacks now to blame for half of healthcare data breaches

January 15, 2021
Researchers Disclose Undocumented Chinese Malware Used in Recent Attacks
Internet Privacy

Researchers Disclose Undocumented Chinese Malware Used in Recent Attacks

January 15, 2021
Apologetic AI Is A Somewhat Sorry Trend, Especially For Autonomous Cars  
Artificial Intelligence

Apologetic AI Is A Somewhat Sorry Trend, Especially For Autonomous Cars  

January 15, 2021
Machine Learning

BlackRock invests in data science & machine learning | Corporate Finance

January 15, 2021
Toyota slapped with $180 million fine for violating Clean Air Act
Internet Security

Toyota slapped with $180 million fine for violating Clean Air Act

January 15, 2021
AI Research at Amazon: Brand Voice, Entanglement Frontier, Humor Recognition  
Artificial Intelligence

AI Research at Amazon: Brand Voice, Entanglement Frontier, Humor Recognition  

January 15, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Ransomware attacks now to blame for half of healthcare data breaches January 15, 2021
  • Researchers Disclose Undocumented Chinese Malware Used in Recent Attacks January 15, 2021
  • Apologetic AI Is A Somewhat Sorry Trend, Especially For Autonomous Cars   January 15, 2021
  • BlackRock invests in data science & machine learning | Corporate Finance January 15, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates