Extraction, transformation and load (ETL) became a familiar concept in the 1990s, when data warehousing became a well known business intelligence (BI) concept. The advent of the web, and the vast volume of data took many organizations’ focus away from ETL to data lakes. Too many people disparaged ETL as a tool of the past. However, as IT has always been aware, data lakes aren’t a solution all to themselves and rebranding to ELT doesn’t change the fact that there are now far more sources and targets than there ever were. Data movement is still a complex problem and metadata management (MDM), and it’s a problem becoming even more challenging as regulatory requirements for privacy mean data must be better tracked and controlled.
Metadata is, simply, the data that describes the data. The metadata tells systems, and then people, whether a field is characters, a number, a currency amount, and more. At a higher level, metadata gives a name to the data. However, all systems have different names for so much of the data. For instance, who is to know if “Payroll tax,” “state payroll tax,” “tax-11,” and “prt2,” are in different systems refer to the same number?
One of the largest challenges in data warehousing is working to mesh the metadata from multiple systems to recognize logical objects such as “payroll tax”. That has only become more challenging with the expansion of systems now in the cloud era. Significant blocks of time are lost in analyzing metadata and synchronizing information being moved forward toward analytical systems.
At the same time, the reverse flow has to be supported. It’s one thing to present a visualization about sales in a multinational company. It’s quite another for the sales VP to want to drill back to the source data when something intriguing is noticed. If the only thing the BI system understands is the metadata labels for the rollups, how can it drill down to details from a system of origin? What is the provenance of the information?
The problem is not only becoming more challenging, it’s becoming more critical. Regulations such as the EU’s GDPR and California’s upcoming CCPA, are requiring more privacy control over consumer data. Identifying private information is the first step towards compliance.
Let us throw another complication into the process. Let’s go back to the multinational. It’s not just the different names in the different systems, it’s different languages. One system has original metadata in English, another in French, and another in German. Translation doesn’t necessarily help.
People can’t work on large, complex, metadata integration sets in a rapid manner. What’s needed is an algorithmic way of doing so. Statistical processes can be used both on the metadata and the data in order to resolve issues and make rapid recommendations on the relationship between different metadata labels.
Metadata Management And Machine Learning
This is where machine learning (ML) can come in. Analyzing a complex enterprise environment can go much faster with automation. What’s interesting about this problem is the solution can sit on the more business intelligence side of the ML equation. A couple of years ago, I wrote about my acceptance of the change to the definition of ML. Computing power has allowed advanced statistical modeling to provide improved insight, so that ML now sits in between AI and BI.
Octopai, is one of the latest companies to attack the metadata management challenge that flows throughout the enterprise information architecture. When I spoke with Amnon Drori, CEO & Co-Founder, we discussed how the data and information can’t accurately move from sources to BI systems without strong metadata linking it at all levels. “Data is already moving through multiple ETL processes in any large company,” said Mr. Drori. “It’s critical to not only look at the data, but to analysis the process to find similarities that help clarify metadata reuse. Being compliant first means understanding all your data, and that means identifying metadata and creating accessible metadata catalogs.”
By using modern ML processes on both data and existing flows, companies can better identify and then manage that data. The result is not just an improvement in analytics.
Modern compliance, with both governmental regulations and contracts, means that companies with strong metadata management can provide provenance of information chains and identify that private information is being kept private. ETL, regardless of how many modern systems try to hide it, still matters; and machine learning is a key tool that can help manage the metadata required to keep information accurate, controlled, and flowing.
Credit: Google News