A key attraction of the cloud for data management is that the abundant, scale-out storage provides a great opportunity to consolidate data and break down data silos. When it comes to analytics, it is also providing an opportunity to break down the application or functional silos that separate data warehouses, which store and process data, from the tools used for ingesting, transforming, and visualizing them.
Over the past few months, we’ve seen examples from Microsoft and SAP, who have blended services such as data pipelining, transformation, integration with cloud object storage, and self-service analytics, into the latest editions of their cloud data warehousing services. Last week at re:Invent, it was Amazon’s turn to step up to the plate.
Amazon announced a series of enhancements and updates to Amazon Redshift, that at first glance might appear overlapping and confusing – and some of Amazon’s explanations don’t necessarily add clarity. But on closer study, these new features provide different paths to bringing the data warehouse and data lake together. They include a new instance that handles large volumes of data more economically; new options for federated query; and new acceleration that speeds query performance.
Introducing a new Redshift compute instance
Let’s start with the new instance, RA3. It separates cooler data and moves it off active compute nodes to tiered storage which includes S3, using AWS’s Nitro hypervisor to speed that movement. The new nodes, which are now generally available, support up to 8 PBytes of compressed data in Redshift “managed storage.” We’ll explain that in a moment. The new RA3 instances are targeted at operational analytics workloads that typically use a subset of the data frequently (otherwise known as “hot” data), but also need access to the full data set.
Here’s how RA3 works. AWS characterizes RA3 as separating compute from storage. While this topology is often associated with elastic computing (where compute instances can be completely turned off when not used), it is used differently here. The RA3 node stays active, but cooler data can be moved away from the Redshift compute cluster to the managed instance of S3 storage. We think the term “managed storage” is confusing because it implies a new storage tier, but in fact, it simply refers to the combination of S3 and cached storage (which is back on the cluster) that RA3 automatically manages. Consider that a form of hot and cold data tiering. For the customer, the benefit of RA3 is first, you don’t have to pay for a supersized instance that accommodates all those petabytes of data; and, secondly, it provides yet another path to bridging the Redshift data warehouse to the data lake.
A new AQUA hue
AWS is introducing some new distributed custom hardware accelerated caching and compute layer becoming available that could speed it up. It could complement the new RA3 instance, which is designed for handling large volumes of data, or work with any other Redshift compute instance.
The new hardware, called the Advanced Query Accelerator (AQUA) for Amazon Redshift, is now in private preview. AQUA tackles a “balance of system” problem that arises when processing distributed data. The challenge, as Amazon describes it, is that that while SSD bandwidth has increased by 12x since 2012, streaming CPU bandwidth has only doubled because of bottlenecks at the internal bus connecting memory and CPU.
AQUA sits inside Redshift’s storage tier, offloading common operations such as encryption, compression, and filtering and aggregation functions that would otherwise require high network bandwidth and bog down the compute cluster. It extends Amazon’s Nitro hypervisor chip, which by the way, employs the same principle of offloading tasks such as networking, storage (via NVMe to EBS), security, management and monitoring functions that would otherwise tie up the CPU. The bottom line is that when Redshift is implemented with the new RA3 instance, AQUA, and Nitro, AWS claims that performance will accelerate up to 10x.
Another new performance-related feature is the preview of new support for materialized views, where you store precomputed results for commonly used queries that are updated incrementally. A common feature with other data warehouses, in Redshift, materialized views can be created based on one or more source tables using filters, projections, inner joins, aggregations, grouping, functions and other SQL constructs.
Did we mention data lakes?
Amazon is introducing several new features for Redshift. The first expands Amazon Redshift Spectrum with new federated query capability, which until now Redshift only supported queries on data in S3, to data sitting in other Amazon RDS databases: specifically, Amazon RDS for PostgreSQL and the PostgreSQL-compatible edition of Aurora. (We expect in the future that other RDS databases will get supported, and we wonder if other targets like DynamoDB or the new MCS offering might also get added.) Secondly, there is a new Data Lake Export capability that can move data from a Redshift cluster in Parquet format to S3. And data types are getting more extensible with the addition of support for geospatial data. This is a data type that is already directly supported in some relational database platforms such as Oracle, Teradata, and Google BigQuery; linked to in others (e.g., SAP HANA and ESRI); but not yet in Microsoft Azure Synapse Analytics.
Extending the cloud data warehouse
For data warehousing, cloud-native platforms provide opportunities for bridging data and application silos. And this area has proven a hotbed for activity.
With the latest enhancements, AWS has provided a couple paths for bridging Redshift to the data lake: through the RA3 nodes that use internal processing optimizations to access data stored in S3, along with the classic federated query (a.k.a., Redshift Spectrum) that projected external tables onto S3. In turn, the new data lake export feature provides a way to keep older data active and accessible to the data warehouse.
As noted earlier, Microsoft and SAP are also extending their cloud data warehousing platforms beyond the traditional analytic database. For Microsoft, Azure Synapse Analytics focuses heavily on integrating the data pipelining and transformation capabilities of Azure Data Factory, makes spark a co-equal engine with SQL, and extends access to Azure Data Lake Storage generation 2 (ADLS Gen2). Like Redshift, visualization tools are not built in, but are easily activated with a few clicks. At the other end of the scale, SAP has focused more on merging self-service analytics into its cloud data warehouse offering SAP Data Warehouse Cloud, by subsuming the capabilities of SAP Analytics Cloud.
The diversity of approaches to extending the cloud data warehouses shows this is still early days in the definition of what a cloud data warehousing service really is.