In the year ahead, we see the cloud, AI, and data management as the megaforces of the data and analytics agenda. And so, picking up where Big on Data bro Andrew Brust left off last week, we’re looking at some of the underlying issues that are shaping adoption.
In the world of data and analytics, you can’t start a conversation today without bringing in cloud and AI. Yesterday in Part I, we hit the cloud checkbox: we explored how the upcoming generation change in enterprise applications will in turn shift the context of how enterprises are going to be evaluating cloud deployment. Today we turn our attention to the core building block – what’s happening in databases, and what we expect to become the sleeper issue this year in AI.
It’s now Data, not Big Data
But first some context. Until now, we framed our annual outlooks as being about Big Data because until recently, it was considered exceptional. The definition of Big Data was introduced by Doug Laney, today a principal with Caserta, back when he was with analyst firm Meta Group in 2001. Big Data was novel because processing it was beyond the existing data warehousing technologies and BI analytic tools of the day.
Today, Big Data is just Data because necessity has become the mother of invention. As we’ll note below, the database universe has expanded well beyond the core relational model to encompass a wide spectrum of data platforms and types. So, we’re now just calling it data, and changing the name of our annual outlook. Of course, we’re not the first to make that observation, as Gartner took Big Data off the hype cycle back in 2015.
Now let’s get back to our regularly scheduled program.
Getting AI out of the black box
Among the industry observations reported by Andrew last week was the perception that AI has become mainstream in analytics. In fact, analytics is the tip of the iceberg as consumers, machines, and organizations consume services that are powered by AI every day. But as consumption of the results of AI spreads across the services that power the economy, there has been growing concern over the ethics, biases, or other assumptions that can easily skew the algorithms and selection of data that powers AI.
Today, AI is hardly considered smart. While the data sets and models can be complex, the decisions lack human context. AI can make yes/no decisions, detect patterns, and provide predictive or prescriptive recommendations, but for the foreseeable future, unlike humans AI won’t be able to learn something in one context and apply it to another. But even making simple decisions, like whether to grant a loan or make recommendations, AI can still cause damage. Former Wall Street quant Cathy O’Neill brought awareness of potential AI bias with her 2016 book Weapons of Math Destruction.
The selection and handling of data is another. Get a large enough data set and you can always find at least some pattern. For instance, collect dietary habits on a large enough pool of licensed drivers and you might find some patterns relating to risk. But as correlation is not always causation, determining whether those patterns are relevant to change underwriting standards or merely freaks of sampling still requires a human in the loop.
As AI becomes more mainstreamed, increasingly, businesses will become accountable for the decisions that are made with the help of AI algorithms, regardless of how powerful or limited their capabilities. Over the past year we’ve seen emergence of early stabs at making AI “explainable” from IBM, Google, H2O.ai and others.
As you’d expect, given that these are still early days where it comes to AI explainability and bias detection, is that the capabilities are still fairly rudimentary: they typically operate at the individual feature or attribute level, akin to seeing the trees but not the forest. Check out disclosure pages such as this or videos that paint a realistic picture of what’s possible today.
For instance, today’s capabilities can identify statistically which feature(s) of a model most influenced the outcome (e.g., generating a decision, prediction, or recognizing an image or text). For extremely simple models, such as those in the last step of a food chain for making decisions in regulated sectors such as finance or healthcare, they can generate “reason codes.” They can also identify which attributes or features should be tracked for potential bias (which is akin to data security tools for identifying PII data). And based on those findings, the tools of today can conduct “disparate impact analysis,” which is a fancy term for identifying whether the model was biased against a particular segment of people. In some cases, the capabilities for interpreting or explaining models is limited to a single framework such as TensorFlow. As to anything more ambitious, today at best there are best guesses for extrapolating more holistic explanations for why models make decisions.
Our take is that model explainability or interpretability is ripe for development. Look for announcements here. Behind all the noise of AI-related product announcements this year, we expect that data science collaboration tools and cloud-based AI and AutoML services will up their game on explainability. Today, most of these services can document changes to models over time, and they will likely utilize model lineage data as the starting point for building out their capabilities to articulate why models make decisions. Initially, these capabilities are likely to present their findings through statistical visualizations, requiring a data scientist to translate. Later on, they will likely later add more natural language capabilities that are aimed at business people.
AI explainability won’t only be about technology, but it will involve best practices as well. One of the interesting lessons we picked up from listening to H2O.ai’s Patrick Hall is, if you want to make your model explainable, don’t make it too complex. Data scientists could learn a thing or two from app developers.
Nonetheless, by year end we’ll still be a long way from being able to get holistic explanations that go beyond individual details or attributes. AI explainability is going to be a work-in-progress for some time to come.
Clash of the titans: Specialized vs. Multi-model databases
After the conclusion of Y2K, the relational database became the enterprise de facto standard, but as data volumes and types exploded, so did a whole new breed of platforms from key-value to document, graph, column stores, blockchain, and more. It’s gotten to the point where Amazon’s portfolio now lists 15 distinct database platforms.
And that’s opened a debate among platform providers that should sound familiar: the age-old debate of single umbrella platform vs best-of-breed has now spread from the application to the database space. On one side, Amazon promotes the strategy of choose the right database for the job; on the others are players like Oracle, Microsoft, and even SAP that have promoted the Swiss Army Knife approach. Traditionally, database platforms such as Oracle or SQL Server have approached multi-model capability by extending their SQL querying capabilities or adding capabilities, such as in-database R or Python support.
With the new generation of born-in-the-cloud databases, many are storing data in a canonical format and then exposing it via APIs. Microsoft Azure Cosmos DB is the poster child for this approach, but peer beneath the surface and you’ll find that some of the specialized cloud-native database platforms from other providers are also using APIs prominently in their architectures.
In a previous life as Ovum analyst, we forecast back in 2014 that the coming age of database diversity would also lead to database overlap (see diagram). Specialized databases would continue to thrive, but they would add capabilities that overlapped to other forms of data, such as relational databases querying JSON documents, or for document-oriented databases to have SQL-like query languages. This is useful for empowering the large base of SQL developers and giving them additional querying capabilities. However, the fact that, for instance, Oracle or IBM Db2 could query JSON was not meant to replace the need for MongoDB; instead, we viewed them as being for edge cases whether the line organization working with a customer transaction database also wanted the ability to query non-relational data on customer profile.
By the way, in that same research, we posed the question of who would “own” the query. Enter the current era of data catalogs.
As we noted in Part I of our 2020 outlook, our view that enterprises will increasingly look at cloud-native as their default deployment choice will simply escalate this almost age-old debate. Our take is that there is no single silver bullet, binary answer.
Don’t get us wrong, fit-for-purpose databases are here to stay. If the use case is heavily centered on a single data type, a database that is promoted as multi-model will be overkill. There is also the question of highly sophisticated capabilities, such as writing extremely complex SQL statements requiring multiple table joins or graph queries that traverse over three hops. For those, it’s best to stick with best-in-class.
But we also expect that edge cases that require a mix of data access approaches will become far more commonplace. Pair an asset management transaction system with IoT data for planning maintenance, or a supply chain planning system with mobile and IoT data, and you’ve got a ready case for extensibility.
And that’s where we’d like to see the cloud-native database providers step up to the plate. As some of their platforms already use APIs to expose data, and they should exploit the potential of providing multiple pathways to the data, pairing SQL, JSON, graph, and/or search, for instance. It’s not just a matter of extending SQL. We expect to hear more about cross-cutting capabilities from each of the major cloud database providers this year.
Our Data Outlook for 2020 is in two parts. For Part I, covering the Hybrid Default cloud, click here.