For Amazon Web Services (AWS), the key to their data management strategy was that you need the right tool for the job. And so, AWS has amassed a portfolio of 15 databases, and over the past few years, rarely did a re:Invent go by without announcement of some new database. So maybe it’s time to take a breath.
Last week, ZDnet colleagues Larry Dignan and Asha Barbaschow spotlighted a new theme emerging at this year’s re:Invent about AWS placing its wagers on data movement between storage, database, and analytics workflows, positioning the strategy as the secret sauce for gaining more workloads from legacy players. And yes, in an audacious move, AWS is seeking to grab your SQL Server workloads courtesy of Babelfish for Aurora PostgreSQL. But to us, the highlight was announcement of AWS Glue Elastic Views that is entering preview.
It’s a response to rivals like Oracle that emphasize “converged databases,” arguing that splitting workloads up into separate data stores is erecting new silos and adding complexity. While at this point, we’re not going to predict that AWS will figuratively close the patent office and stop inventing new databases, there is the need to tie it all together. It’s the latest stop on AWS’s integration journey, providing a much simpler alternative to what has come before.
AWS’s data integration journey
First, a look at where AWS has come from. AWS is not new at the data and database integration game. But until now there have been some limitations to those capabilities. And many of AWS’s database integration paths had operational complexities, such as the need to set up configurations to get data or answers flowing, not to mention the need to either manually trigger updates or code them to keep them flowing.
It started modestly with AWS Glue, initially designed as an ETL service. Over the years, Glue has added a data catalog, a schema registry, and now, Elastic Views, which we’ll focus on below. Beyond Glue, AWS had other paths for integration between its databases. For instance, several years ago, AWS extended the Amazon Redshift data warehouse with Amazon Redshift Spectrum, a capability that queries S3 cloud storage in massively parallel fashion, to aggregating the data and then sending it back to the local Redshift cluster to generate the final result. With Spectrum, data in S3 is treated as an external table than can be joined to local Redshift tables — you don’t extend a Redshift table to S3, but can join to it.
If Redshift Spectrum sounds like federated query, Amazon Redshift Federated Query is the real thing. It initially worked only with PostgreSQL – either RDS for PostgreSQL or Aurora PostgreSQL. But there’s new support for RDS and Aurora MySQL being announced today. Here, query processing from Redshift is pushed down to the RDS or Aurora instance, which sends only the results back to the local Redshift cluster. Because Redshift itself is also descended from PostgreSQL, commonly used data types are identical, but for instance, data types such as JSON, JSONB (binary JSON), arrays, monetary types, non-integer numbers, XML, and other data types must be converted to generic variable character fields. There will be some additional transforms with the new support for MySQL.
There have been interactions between some of the other data services, including a bidirectional open source connector between Amazon EMR and DynamoDB. You can use a customized version of Hive to run operations from EMR on data inside DynamoDB or load data from DynamoDB into EMR. In turn, you can also stream updates from DynamoDB into Amazon Elasticsearch Service through a plugin for Logstash.
And, if you just want to query data in your data lake without setting up a database, there’s Amazon Athena. It uses Presto to run massively parallel queries in S3. The serverless Athena was intended for ad hoc query because, no matter what performance optimizations are applied, querying cloud storage will never be as efficient compared to having a database that is either indexed or laid out as a column store that uses compression and filters to optimize performance. Consider Athena as the exploratory query capability so, when you decide to operationalize a query, you’ll likely migrate and transform the data to run in Redshift.
Glue Elastic Views cuts the chase
AWS Glue Elastic views promises a simpler way by relying on a longtime popular data warehousing pattern: building materialized views that are automatically updated. Data warehouses routinely use materialized views to avoid constantly running queries utilizing the same joins.
AWS Glue Elastic Views is set up to build bridges between relational databases, non-relational databases, object storage, and analytics stores across the AWS portfolio. Initially, it supports Amazon DynamoDB (as a source at this point), Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, with support for Amazon RDS, Amazon Aurora, plus others that we expect (including AWS and non-AWS databases) to follow.
The process starts with building a SQL query using PartiQL (pronounced “particle”), a SQL-compatible language open sourced by AWS that was originally designed for querying nonrelational data such as logs. AWS has been using PartiQL since it became production-ready last year, and has shown up in Redshift Spectrum and DynamoDB, for example. While PartiQL flattens nested data types such as JSON, it preserves metadata, so the richness of the hierarchy is retained. Surprisingly, while this is a Glue product, Elastic Views doesn’t use Glue for the ETL part. Instead, it uses the PartiQL query to form the data, publishing a change-data-capture (CDC) stream from the source and landing it as a materialized view in the target.
With Glue Elastic Views, you could stream real-time changes to product catalogs maintained in DynamoDB into Elasticsearch, which presents a more intuitive environment for customers to find product listings. While you could previously perform this task with the dedicated DynamoDB-Elasticsearch connector, the advantage of Elastic Views is that the processes are far simpler and, as part of the service, changes are automatically replicated. With the original connector, this would have been a task that would have required significant manual (and potentially error-prone) coding.
For AWS, variety has been the spice of life, stretching from its hundreds of services and permutations of EC2 compute and storage infrastructure to the variety of analytic, machine learning, container development, and database services, among others. We’re tempted to use the metaphor that by running enough database services, AWS is hoping some of them would stick (together). But instead, we’ll leave with this: AWS’s challenge is to build on the synergies that could bind its diverse services together. Glue Elastic Views is a good start.