Sunday, April 11, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Big Data

Apache Arrow: The little data accelerator that could

February 25, 2019
in Big Data
Apache Arrow: The little data accelerator that could
614
SHARES
3.4k
VIEWS
Share on FacebookShare on Twitter

Credit: ZDnet

A few years back, we noted the emergence of Apache Arrow; what piqued our attention was that the backers consisted of “a who’s who list” of over 20 committers from the likes of Cloudera, MapR, Hortonworks, Salesforce.com, DataStax, Twitter, AWS, and Dremio.

You might also like

Weaviate is an open-source search engine powered by ML, vectors, graphs, and GraphQL

MinIO simplifies onramps to do-it-yourself hybrid cloud object storage

Trifacta goes all in on the cloud

As we characterized it then, Arrow was about big data, almost literally, lining its duck up in a… column. Arrow is a standard columnar format for persisting data efficiently in memory. You’d think that in-memory compute would simply brute force performance, which was one of the original draws of Spark. But memory isn’t just a fast black box. There’s a trick to loading data so it can be read efficiently; that’s why developers often ran out of memory.

The challenge with balancing, in this case the loading of data into memory and the feeding of data from memory to compute is the symptom of a larger age-old problem that we used to refer to as balance of system. For years, Teradata sold different data warehousing appliances designed for data-intensive, compute-intensive, or mixed workload scenarios. Just because we’ve either brought compute to data, as in Hadoop clusters, or separated data from compute, as in the cloud, the challenge remains the same. In fact, it grows even more complicated when you have problems that really take resource requirements to the mix: deep learning AI problems that crunch lots of data while using special GPU hardware for compute.

Apache Arrow was conceived to solve a balance of system problem for data scientists: making sure that they didn’t run out of memory when running their models or run out of budget because they overallocated memory. This wasn’t an imaginary problem. For Wes McKinney, the Python and serial open source developer who created the Pandas data analysis library for Python (among other projects), it was the frustration of hitting that memory wall. He learned the hard way that, if you had a large data set, you’d better allocate at least 5 – 10x the amount of RAM to it. There needed to be a better way to reduce the trial, error, cost, and complexity of moving data into memory and accessing it.

McKinney, who is currently director of Ursa Labs, and Jacques Nadeau, CTO of Dremio, helped launch Arrow, which became an Apache project back in 2016. Arrow got supported by a critical mass of big data open source projects from the get-go: the Hadoop ecosystem, Spark, Storm, Cassandra, Pandas, HBase, and Kudu. It can read data from popular storage formats such as Apache Parquet, CSV files, Apache ORC, and JSON. Its initial support for Python has grown to 11 programming languages, among them C++, Java, Python, R, C#, JavaScript, and Ruby. When deployed, for instance with Spark, it has been benchmarked as improving performance up to 25x.

Given the wide support, the Apache project page listing a sampling of products and projects using Arrow is a bit underwhelming, as few of them are household names. Examples include Fletcher, a framework for converting an Arrow schema to work with FPGAs; Graphistry, a visual investigation platform used for security, anti-fraud, and related investigations; and Ray, a high-performance distributed execution framework designed for machine learning and AI applications. But where there’s smoke, there’s fire; download rates from the project portal are averaging about 1 million monthly. The community remains active; over the past year nearly 300 individuals have submitted more than 3000 contributions.

So where is Arrow pointing from here? The most exciting project involves its role as the foundation for cuDF, the DataFrame foundation library for RAPIDS that is built around Arrow. There is Gandiva, the emerging SQL execution kernel for Arrow developed by Dremio that is based on the LLVM open source compiler. Another initiative is around transport so data marshaled on one Arrow node can be efficiently replicated or moved to another.

But the missing piece is streaming, where the velocity of incoming data poses a special challenge. There are some early experiments to populate Arrow nodes in microbatches from Kafka. And, as the edge gets smarter (especially as machine learning is applied), it will also make sense for Arrow to emerge in a small footprint version, and with it, harvesting some of the work around transport for feeding filtered or aggregated data up to the cloud.

It’s not time for the Arrow folks to pack up and go home yet.

Credit: ZDnet

Previous Post

Monitoring Customer Data With Machine Learning

Next Post

And Monty Hall Went Bayesian...

Related Posts

Weaviate is an open-source search engine powered by ML, vectors, graphs, and GraphQL
Big Data

Weaviate is an open-source search engine powered by ML, vectors, graphs, and GraphQL

April 8, 2021
MinIO simplifies onramps to do-it-yourself hybrid cloud object storage
Big Data

MinIO simplifies onramps to do-it-yourself hybrid cloud object storage

April 7, 2021
Trifacta goes all in on the cloud
Big Data

Trifacta goes all in on the cloud

April 6, 2021
Cloudera Data Platform hits Google Cloud
Big Data

Cloudera Data Platform hits Google Cloud

March 31, 2021
Cloudera fills gap in streaming platform with SQL
Big Data

Cloudera fills gap in streaming platform with SQL

March 31, 2021
Next Post
And Monty Hall Went Bayesian…

And Monty Hall Went Bayesian...

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Job Scope For MSBI In 2021
Data Science

Job Scope For MSBI In 2021

April 11, 2021
Basic laws of physics spruce up machine learning
Machine Learning

New machine learning method accurately predicts battery state of health

April 11, 2021
Can a Machine Learning Model Predict T2D?
Machine Learning

Can a Machine Learning Model Predict T2D?

April 11, 2021
Leveraging SAP’s Enterprise Data Management tools to enable ML/AI success
Data Science

Leveraging SAP’s Enterprise Data Management tools to enable ML/AI success

April 11, 2021
Machine Learning in Finance Market is exclusively demanding in forecast 2029 | Ignite Ltd, Yodlee, Trill A.I., MindTitan, Accenture, ZestFinance – KSU
Machine Learning

Machine Learning in Finance Market is exclusively demanding in forecast 2029 | Ignite Ltd, Yodlee, Trill A.I., MindTitan, Accenture, ZestFinance – KSU

April 10, 2021
Vue.js vs AngularJS Development in 2021: Side-by-Side Comparison
Data Science

Vue.js vs AngularJS Development in 2021: Side-by-Side Comparison

April 10, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Job Scope For MSBI In 2021 April 11, 2021
  • New machine learning method accurately predicts battery state of health April 11, 2021
  • Can a Machine Learning Model Predict T2D? April 11, 2021
  • Leveraging SAP’s Enterprise Data Management tools to enable ML/AI success April 11, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates