One of the truisms in the technology world that has never gone out of style is the notion of “garbage in, garbage out.” And as enterprises are entering, what Informatica terms, the era of “Data 3.0.” the company believes that AI – in most cases here, machine learning – is the only way that enterprises will be able to keep the data quality issue under control.
Paraphrasing Informatica, “Data needs AI, and AI needs data.” Hold that thought.
The data estate has certainly grown, as enterprises now routinely go beyond their walled wardens of transaction data to digesting IoT data, social media, messaging, log files, and other sources to support use cases ranging from customer engagement to cybersecurity, supply chain management, and operational efficiency.
We first stumbled across Informatica at a database expo back in the mid-1990s. The company demo’ed their visual ETL tool PowerMart in a hotel suite, because as a startup, they were too small to pay for a booth on the expo floor. Informatica changed the conversation because until then, ETL was largely about manual scripting. They added a metadata engine that could be used for storing data transformation recipes fronted by a GUI that required, not necessarily coding skills, but knowledge of schema. Since then, we’ve come full circle. Among the announcements that came out of this year’s Informatica World back in May was a new feature that could parse Python scripts for data lineage. At least for data scientists, coding data transformations remain very much in style.
Over the years, Informatica has expanded its footprint from its ETL roots to data quality, data profiling, metadata management, and master data management, among others. It’s done so in a market, that as Forrester analyst Michelle Goetz pointed out in these pages, has never really coalesced. In the interim, lots of point tools – some of them open source — have emerged, but aside from IBM, there’s never been any Informatica-scaled player to emerge to challenge Informatica.
In a recent blog post, Amalgam Insights analyst Lynne Baer (no relation, at least that we know of), delivered a good overview of how Informatica’s underlying CLAIRE machine learning engine is driving Informatica’s product strategy. CLAIRE itself is not a product or tool, but it’s an umbrella for the machine learning capabilities that are sprinkled through Informatica’s suite. Examples include automating parsing, transforming, and joining of variably structured data; tagging so data can be identified for classification, governance, and identifying privacy-sensitive data; flagging of potential data quality issues or place to de-duplicate; scanning a data set to generate data quality rules; scoring data so it can be labeled in a business glossary; and provide machine assistance in data discovery.
In a briefing a few weeks back, Informatica made a case for why data quality and data integration has outgrown the ability of humans. It starts with the torrents of data and the nature of the data. When you are tapping the social media or IoT firehose, you are often dealing with ingesting terabytes of data at a time. As multi-structured data, schema is far more complex.
So far, this is not a case of ordinary CSV customer or product order files where, even if the schema is not consistent, it may be fairly straightforward to identify what’s a name or what’s a numerical field such as an order number, SKU, part number, phone number, tax ID, or social security number. Here, data prep tools emerged that used a modest level of machine learning to conduct pattern matching to identify the columns and how the columns of different data sets should be transposed or merged. Instead, the challenge is munging files where the data structure is far more cryptic and variable to the point where humans may not be able to parse it without some machine assist.
So how is Informatica sprinkling AI into its portfolio? From its recent spate of Informatica World announcements, it has added smarter matching recommendations for mapping fields from source to target in its managed cloud service; adding capabilities for detecting “schema drift” in big data; automated creation of “contextual” Customer 360 views that infer relationships and recommendations based on customer behavior and history of preferences; an AI assist for generating data governance rules; and incorporation of machine learning into its data catalog offering that helps users in data discovery and annotation.
Informatica has made its case that with more data and more varied data than ever, AI has become essential for enterprises to avoid getting buried in garbage data. The other end of the equation is that, as enterprises start engaging in their own AI or machine learning projects, there are too many places where these projects can go off the rails. It begins with the fact that AI models are ravenous for data. While we’ve seen isolated cases where AI models might not always require vast torrents of data, in most cases, models feed on data to learn. The impact of using the wrong data on a machine learning or deep learning model, or not using an adequately sized cohort of data to train and then run the model, will be arguably far greater compared to working with compromised data on a static data science model.
And so that brings us full circle. AI models require lots of data. And according to Informatica, when you have lots of data, you’ll need AI to separate the signal from the noise.