Thursday, February 25, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Data Science

AWK — a Blast from Wrangling Past.

September 22, 2019
in Data Science
AWK — a Blast from Wrangling Past.
592
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

I recently came across an interesting account by a practical data scientist on how to munge 25 TB of data. What caught my eye at first was the article’s title: “Using AWK and R to parse 25tb”. I’m a big R user now and made a living with AWK 30 years ago as a budding data analyst. I also empathized with the author’s recountings of his painful but steady education on working with that volume of data: “I didn’t fail a thousand times, I just discovered a thousand ways not to parse lots of data into an easily query-able format.” Been there, done that.

You might also like

How Machine Learning Discretely Assists Data Scientists

A Plethora of Machine Learning Articles: Part 1

AI Chatbot Platforms: The Best in the Market and Why to Consider

After reading the article, I was again intrigued with AWK after all these years. A Unix-based munging predecessor of perl and python, AWK’s particularly adept at working with delimited text files, automatically splitting each record into fields identified as 1, 2, etc. My use of AWK generally revolved on selecting columns (projecting) and rows (filtering) from text files, in turn piping the results to other scripts for additional processing. I found that AWK did these simple tasks very well but didn’t scale for more demanding data programming — remembering well that trouble lurked when I attempted to contort AWK to do something it wasn’t intended to do. And indeed, I pretty much abandoned AWK when the more comprehensive perl emerged in the late 80s. In retrospect, I’m not sure that was the best course. Optimal might have been to continue using AWK for the simpler file project and filter work, saving perl (and then python) for more complex tasks.

So I just had to reacquaint myself with AWK, and downloaded the GNU version gawk. I then divined several quick tasks on a pretty large data source to test the language. The data for analyses consist of 4 large files of census information totaling over 14 GB which, in sum, comprise 15.8M records and 286 attributes. I use AWK to project/filter the input data, and then pipe the results to python or R for analytic processing. AWK does some pretty heavy albeit simple processing. In my tests, both R and python/pandas could have handled AWK’s tasks as well, but it’s not hard to imagine a pipeline that required pre project/filtering.

Unlike other blogs I’ve written using Jupyter Notebook, this one does not execute in a python or R kernel; rather the notebook simply displays the AWK, python, and R scripts and their outputs.

The technology used below is Windows 10, JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, R 3.6.0, Cgywin 3.0.7, and GNU Awk (gawk) 5.0.1. All gawk, python, and R scripts are simply components in pipelines generated from bash shell command lines in Cgywin windows.

Read the entire blog here.


Credit: Data Science Central By: steve miller

Previous Post

Artificial intelligence could help to translate critical Earth observation data • Earth.com

Next Post

WannaCry ransomware is still infecting PCs - and some victims are still trying to pay the ransom

Related Posts

How Machine Learning Discretely Assists Data Scientists
Data Science

How Machine Learning Discretely Assists Data Scientists

February 24, 2021
A Plethora of Machine Learning Articles: Part 1
Data Science

A Plethora of Machine Learning Articles: Part 1

February 24, 2021
What are Data Pipelines ?
Data Science

AI Chatbot Platforms: The Best in the Market and Why to Consider

February 24, 2021
Modernizing Data Dashboards. – Data Science Central
Data Science

Modernizing Data Dashboards. – Data Science Central

February 24, 2021
4 ways Cryptocurrency is Benefiting the Fintech Industry
Data Science

4 ways Cryptocurrency is Benefiting the Fintech Industry

February 23, 2021
Next Post
WannaCry ransomware is still infecting PCs – and some victims are still trying to pay the ransom

WannaCry ransomware is still infecting PCs - and some victims are still trying to pay the ransom

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Even Small Companies Use AI, Machine Learning
Machine Learning

Even Small Companies Use AI, Machine Learning

February 25, 2021
How Is Machine Learning Revolutionizing Supply Chain Management | by Gina Shaw | Feb, 2021
Neural Networks

How Is Machine Learning Revolutionizing Supply Chain Management | by Gina Shaw | Feb, 2021

February 25, 2021
Reaching customers at scale without losing their trust: Wednesday’s daily brief
Digital Marketing

Reaching customers at scale without losing their trust: Wednesday’s daily brief

February 25, 2021
Want to pass on your old PCs to good causes? Here’s how to do it while staying secure
Internet Security

Want to pass on your old PCs to good causes? Here’s how to do it while staying secure

February 24, 2021
Experts Warns of Notable Increase in QuickBooks Data Files Theft Attacks
Internet Privacy

Experts Warns of Notable Increase in QuickBooks Data Files Theft Attacks

February 24, 2021
Cutting-edge Katana Graph scores $28.5 million Series A Led by Intel Capital
Big Data

Cutting-edge Katana Graph scores $28.5 million Series A Led by Intel Capital

February 24, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Even Small Companies Use AI, Machine Learning February 25, 2021
  • How Is Machine Learning Revolutionizing Supply Chain Management | by Gina Shaw | Feb, 2021 February 25, 2021
  • Reaching customers at scale without losing their trust: Wednesday’s daily brief February 25, 2021
  • Want to pass on your old PCs to good causes? Here’s how to do it while staying secure February 24, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates