Tuesday, March 2, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Data Science

Speed up your R Work

February 4, 2020
in Data Science
Speed up your R Work
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

This article was written by John Mount.

 

You might also like

Why Cloud Data Discovery Matters for Your Business

DSC Weekly Digest 01 March 2021

Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tablesto partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Keep in mind: unless the pipeline steps have non-trivial cost, the overhead of partitioning and distributing the work may overwhelm any parallel speedup. Also data.table itself already seems to exploit some thread-level parallelism (notice user time is greater than elapsed time). That being said, in this note we will demonstrate a synthetic example where computation is expensive due to a blow-up in an intermediate join step.

 

Conclusions

The benchmark timings show parallelized data.table is the fastest, followed by parallelized dplyr, and parallelized rqdatatable. In the non-paraellized case data.table is the fastest, followed by rqdatatable, and then dplyr.

A reason dplyr sees greater speedup relative to its own non-parallel implementation (yet does not beat data.table) is that data.table starts already multi-threaded, so data.table is exploiting some parallelism even before we added the process level parallelism (and hence sees less of a speed up, though it is fastest).

rquery pipelines exhibit superior performance on big data systems (Spark, PostgreSQL, Amazon Redshift, and hopefully soon Google bigquery), and rqdatatable supplies a very good in-memory implementation of the rquery system based on data.table. rquery also speeds up solution development by supplying higher order operators and early debugging features.

In this note we have demonstrated simple procedures to reliably parallelize any of rqdatatable, data.table, or dplyr.

To read the rest of the article, with an example and source code, click here. This material is also available on GitHub, here. 

 

 


Credit: Data Science Central By: Andrea Manero-Bastin

Previous Post

ESG investing – the alpha is in the alternative data

Next Post

Sudo Bug Lets Non-Privileged Linux and macOS Users Run Commands as Root

Related Posts

Why Cloud Data Discovery Matters for Your Business
Data Science

Why Cloud Data Discovery Matters for Your Business

March 2, 2021
DSC Weekly Digest 01 March 2021
Data Science

DSC Weekly Digest 01 March 2021

March 2, 2021
Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game
Data Science

Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game

March 2, 2021
Importance of Data Science in Modern Age
Data Science

Importance of Data Science in Modern Age

March 2, 2021
Tweaking Algorithmic Filtering to Combat Fake News
Data Science

Tweaking Algorithmic Filtering to Combat Fake News

March 2, 2021
Next Post
Sudo Bug Lets Non-Privileged Linux and macOS Users Run Commands as Root

Sudo Bug Lets Non-Privileged Linux and macOS Users Run Commands as Root

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Google addresses customer data protection, security in Workspace
Internet Security

Google addresses customer data protection, security in Workspace

March 2, 2021
New ‘unc0ver’ Tool Can Jailbreak All iPhone Models Running iOS 11.0
Internet Privacy

New ‘unc0ver’ Tool Can Jailbreak All iPhone Models Running iOS 11.0

March 2, 2021
Why Cloud Data Discovery Matters for Your Business
Data Science

Why Cloud Data Discovery Matters for Your Business

March 2, 2021
Opportunity, Trends, Share, Top Companies Analysis (Based on 2021 COVID-19 Worldwide Spread) – NeighborWebSJ
Machine Learning

Opportunity, Trends, Share, Top Companies Analysis (Based on 2021 COVID-19 Worldwide Spread) – NeighborWebSJ

March 2, 2021
Australia’s new ‘hacking’ powers considered too wide-ranging and coercive by OAIC
Internet Security

Australia’s new ‘hacking’ powers considered too wide-ranging and coercive by OAIC

March 2, 2021
DSC Weekly Digest 01 March 2021
Data Science

DSC Weekly Digest 01 March 2021

March 2, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Google addresses customer data protection, security in Workspace March 2, 2021
  • New ‘unc0ver’ Tool Can Jailbreak All iPhone Models Running iOS 11.0 March 2, 2021
  • Why Cloud Data Discovery Matters for Your Business March 2, 2021
  • Opportunity, Trends, Share, Top Companies Analysis (Based on 2021 COVID-19 Worldwide Spread) – NeighborWebSJ March 2, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates