At its Digital Summit virtual event today, real-time NoSQL database player Aerospike announced a new release of its eponymous product. The v5.6 release adds a few features that together are designed to optimize the loop of real-time data processing and machine learning at the edge and “core” (cloud or corporate data center). The scenarios furthermore involve training machine learning (ML) models at the core from edge data, then pushing the models back to the edge for inferencing.
Three legs of the stool
ZDNet spoke with Aerospike founder and Chief Product Officer Srini Srinivasan, who briefed us on the three features that facilitate and optimize this virtuous data/ML cycle. They are:
- Set indexes: which accelerate access to data in Aerospike sets (comparable to tables). The company says this feature makes for fast queries of sets, even in a petabyte-scale database.
- Enhancements to Aerospike expressions: read and write operations can now be embedded within the implementations of expressions. Srinivasan explained that expressions, which are implemented in C, execute much more efficiently than user defined functions, and move processing closer to the data.
- Updated Aerospike Connect for Spark: this connector is now compatible with Apache Spark 3.0. This in turn, allows developers to use Spark 3.0 and its APIs directly against Aerospike (bringing back data as Spark DataFrames).
Lighting up Spark
The Aerospike connector for Spark allows real-time and historical data in the database to be used for training ML models, without requiring that data to be exported first. Also, explained Srinivasan, Aerospike can manage data sets larger than what might fit in memory, which enables the otherwise memory-oriented Spark to work with high-volume data, potentially much faster than Spark working against, say, Parquet files in cloud storage.
The more Spark can “push down” data operations to Aerospike, the better, and the Aerospike connector will delegate the work aggressively that way when Spark code queries it. Such operations will then further benefit from the set indexes and expression enhancements that are also introduced in the new release. Aerospike’s connector for Presto (and, one would assume, Trino) operates similarly and benefits Presto users in a comparable fashion.
(Data)frame of reference
This pattern of allowing Spark developers to work natively against external databases is gaining momentum. Other databases, like Splice Machine, have enabled similar interfaces. Spark is now such a standard that its DataFrames are becoming a developer’s universal abstraction layer over data for the purposes of stream processing, querying, data engineering and ML.
Given the huge array of database and analytics platforms that have emerged over the last decade, it’s good to see that one of them is becoming a tool of consensus for working with several of the others. It’s also good to see that Aerospike now enables this for Spark 3.0.