Alluxio, the company whose data orchestration layer is based on the open source in-memory file system project originally known as Tachyon, announced version 2.0 of its product last week at the AWS Summit in New York City. Alluxio 2.0 delivers cool stuff, including integrations with Amazon Web Services’ Elastic MapReduce (EMR) service. This post covers these new 2.0 features, as well as Alluxio’s interesting underpinnings. Skip to the “New, in v2” section, below, for the news, as the conceptual stuff comes first.
A fine mess they’ve gotten us into
The modern open source data stack is a disaggregated, loosely-federated collection of open source projects, paired with some commercial offerings. A perhaps inconvenient truth is that this phenomenon has resulted in a stack that contains a great number of data silos. Adding to the challenge, the increasing momentum of using cloud object storage for analytics and data lakes slows things down. Caching data in memory can help, but it’s not a panacea, as each compute framework tends to do so in its own way, which only exacerbates the silos issue.
On the other hand, most data frameworks know how to access file systems, including the Hadoop Distributed File System (HDFS) on-premises, and Amazon Simple Storage Service (S3), Azure Blob/Data Lake Storage (WASB/ADLS) and Google Cloud Storage (GCS) in the cloud. As such, implementing an in-memory cache accessible via common file system APIs seems like a good way to unify the otherwise fragmented ecosystem, in a way that accelerates data access and enables sharing of data between frameworks (which is what the data lake construct is all about).
That’s where the Tachyon/Alluxio project comes in. The project was incubated at UC Berkely’s AMPLab, the same organization that incubated what is now Apache Spark. Haoyuan (H.Y.) Li, the then Ph.D. student behind the project, founded Alluxio (originally Tachyon Nexus) and currently serves as its CTO. According to Crunchbase, the company has received $16M in funding through two rounds, the most recent being its $8.5M Series B in January, 2019.
You can think of Alluxio, which is available in both Community and Enterprise editions, as an in-memory cache. A data virtualization gateway would be another apt categorization. You can think of Alluxio as a file system too — which aligns it with the data lake construct of standalone data sets stored in file formats like CSV and Parquet. And, for folks from a relational database background, Alluxio says you can think of it as the heir to an RDBMS buffer pool. No matter how you think of it, though, it supports HDFS, S3, POSIX and Java file system interfaces, optimized for clients including Spark, Presto and Hive.
In general, then, data sources use Alluxio as an in-memory file system cache abstraction over data, accelerating data access performance and simplifying connections to the data itself. And though Alluxio can be acquired and implemented in a standalone manner, it’s now also available in OEM form. Alluxio announced last month that the product’s now available from Starburst, integrated with that company’s commercial Presto distribution, such that Alluxio and Presto worker nodes are co-located, optimizing data locality and accelerating performance overall.
New, in v2
Version 2 of Alluxio, which has been released to general availability (GA), sports a range AWS-specific integrations. To begin with, the product is available for evaluation and deployment in the form of an Amazon Machine Image (AMI). That’s a nice way to get started, but perhaps even better, Alluxio can be deployed to an EMR cluster. This is done via an EMR bootstrap action, allowing Alluxio to be installed on the EMR cluster when it is first provisioned.
Outside specific vendor ecosystems, Alluxio has now added REST-based services to its list of supported data sources. When combined with the product’s support for the Tensorflow deep learning framework, this makes for interesting AI implementations, including building models on data resident on Web sites like Kaggle, the Google owned data science site, and data.gov, the United States Government’s open data portal.
V2 also adds policy-driven features to support data tiering, allowing “hot,” “warm” and “cold” data to reside in-memory, on solid state drive (SSD) media or spinning hard disk drive (HDD) infrastructure, respectively. While that’s all very nice for on-premises work, v2 also adds a data service that facilitates movement of data across different public cloud storage layers.
Other features, including cluster partitioning, adaptive replication and a high-availability mode called Embedded Journal, along with integration of RocksDB for tiering metadata storage and gRPC for intra-cluster communication, round out the 2.0 release.
When Tachyon emerged, an in-memory file system seemed like a cool idea and something that would be generally useful. At the time, its apparent utility was a mostly an intuitive judgement. But with the movement of data lakes to and across public clouds, along with the continued proliferation of data compute frameworks and query engines, the need for Alluxio seems much more concrete.
Yes, the open source and startup data worlds have delivered innovative technology in response to the hegemony of incumbent Enterprise data warehouse and BI platforms. But in doing so, it lost sight of the value in the integration and optimization these single-vendor platforms provided. The result has been an absurd proliferation of data silos. Thankfully, some players, including Alluxio, are trying to address and mitigate the modern data stack’s complexity. It’s beyond time we dispelled the purist notion that decoupled-everything is the way to go. Platforms like Alluxio seek to give us back the cohesion the industry foolishly neglected and rejected.