According to a recent KPMG report, Artificial Intelligence has gone passed the experimentation phase at a number of companies.
The report, quoted in the WSJ, which consists of interviews with executives at 30 Global 500 companies, found that 30% are using artificial intelligence or machine learning in selective functions, while 17% said they have deployed AI and ML at scale across the enterprise.
As companies deploy more and more AI solutions. One field is gaining traction to help with these deployments. It’s data management. Before any project can be launched, the cleaning, organizing, and preparation of data are key.
It’s often one of the longest phases of a data science project. Actually many data scientists complain about the time they spend cleaning, organizing and managing data before they are fed to a model, leaving very few hours for model tuning, results analysis and experimentation.
According to Gartner, “Data preparation is an iterative and agile process for exploring, combining, cleaning, and transforming raw data into curated datasets for self-service data integration, data science, data discovery, and BI/analytics.
To perform data preparation, data preparation tools are used by analysts, citizen data scientists and data scientists for self-service.
The tools are also used by citizen integrators and data engineers for data enablement to reduce the time and complexity of interactively accessing, cataloging, harmonizing, transforming, and modeling data for analytics in an agile manner with metadata and lineage support.”
Gartner defines the capabilities of the filed as below:
- Data ingestion and profiling — a visual environment that enables users to interactively ingest, search, sample and prepare data, and also match, tag and annotate data for future exploration.
- Data cataloging and basic metadata management — Creating and searching metadata; cataloging of data sources, transformations, data source attributes, data lineage and relationships, and APIs.
- Data modeling and transformation — Supports data mashup and blending; data cleansing; filtering; and user-defined calculations, groups, and hierarchies.
- Data security — Inclusion of security features such as data masking, platform authentication and security filtering at the user/group/role level.
- Basic data quality and governance support — Integration with tools supporting data governance/stewardship and capabilities for data quality, user permissions and data lineage.
- Data enrichment — Support for basic data enrichment capabilities including entity extraction, the capturing of attributes from the integrated data using the data preparation tool.
- User collaboration and operationalization — Facilitates the sharing of queries and datasets, including publishing, sharing and promoting models with governance features such as dataset user ratings or official watermarking.
Other differentiating capabilities :
- Data source access/connectivity — APIs and standards-based connectivity, including native access to cloud application and data sources, such as popular database PaaS (dbPaaS) and cloud data warehouses
- Machine learning (for insights and automation) — Use of machine learning AI to improve and, in some cases, even automate the data preparation process. Inclusion of algorithms to enable users to identify data structures, schemas and relationships, and the ability to structure the datasets upon initial data ingestion.
- Hybrid and multi-cloud deployment options — Organizations need data preparation tools that can be deployed either in the cloud (through platform as a service [PaaS]), on-premises, or across both cloud and on-premises in a hybrid integration platform setting.
- Domain- or vertical-specific offerings or templates — Packaged templates or offerings for domain- or vertical-specific data and models that can further
Flatfile, a year-old, Denver, Co.-based startup that says its API turns user data into product data, has raised $2 million in funding led by Afore Capital. VentureBeat has more here.
Trifacta, a 7.5-year-old, San Francisco-based company that specializes in cleaning corporate data so it can be analyzed, has raised $100 million from Telstra Ventures, Energy Impact Partners, NTT DOCOMO Ventures, BMW iVentures, and ABN AMRO. Fortune has more here.
A two-year-old, Israel-based data discovery platform for machine learning models, has quietly raised two rounds of funding totaling $19.1 million from investors that include Emerge, F2 Capital, and Zeev Ventures, among others. TechCrunch has more here.
The first one is Flatfile.io, a Denver, Colorado-based start up, started 2 years ago. The company was started based on the founder’s frustration when trying to import data in its previous company system. In February the company had a dozen paying (and more than a hundred free) customers and was building the next iteration of Flatfile based on what they learned from the 15,000 uploaded files (and 50m uploaded records).
What Flatfile solves:
- JSON configurator allows CSV fields to be mapped to specific data models. Subsequent CSV data imports from users will adhere to this data model.
- Automatic match of 95% of imported columns, using a combination of machine learning and fuzzy matching.
- Users can upload data via CSV, XLS, or simply paste from the clipboard.
- Analyze past data uploads to help resolve import issues without any guesswork.
- Data processing entirely in the browser ensures that no sensitive data is ever sent to our servers
- Flag data that doesn’t seem to be formatted correctly — or perhaps a typo or a new type of data format that’s germane to a certain industry — and then ask the user about it.
Explorium is an automated data and features discovery platform. This one is not a direct competitor as its not specialized in the data preparation, data cleaning, data wrangling space.
Explorium is more of data enrichment platform, it will analyze data with machine learning and look for complementary data sources to enrich the customer’s data.
“We are doing for machine learning data what search engines did for the web,” said Explorium CEO Maor Shlomo, to VentureBeat.
The platform provides Contextual understanding whereas the platform recognizes the meaning behind datasets in order to make connections to the best enrichment sources possible.
We developed a new type of search engine that’s capable of looking at the customers data, connecting and enriching it with literally thousands of data sources, while automatically selecting what are the best pieces of data, and what are the best variables or features, which could actually generate the best performing machine learning model,” he explained to Techcrunch.
Trifacta is the largest of the 3, having raised $100M, its been recognized by Gartner as the #1 solution for data preparation. The company is not a unicorn yet according to VentureBeat. Trifacta CEO Adam Wilson, believe his tool will help data scientists on being glorified move“data janitor.
The main mantra on which the company is the “ garbage in, garbage out” old adage. “People have woken up to the fact that if your data quality is bad, your A.I. and machine learning is going to be worthless,” Wilson tells Fortune. “The last thing they want to do is to automate bad decisions faster based on bad data.”
The company’s competition — which includes those who sell data-cleaning tools or have incorporated data-cleaning features into their products — include startups like Tamr, Paxata, Alteryx (which went public in 2017), Microsoft, and Tableau.
Let’s have a look at some of the players in the data preparation/data wrangling space
Tamr’s software takes in huge amounts of “unclean” data and uses machine-learning technology to clean it up and make it more useable. They have worked for GE. “Tamr was able to take hundreds of thousands of GE supplier records and identify where multiple records were actually from the same supplier,” said Emily Galt to Fortune, vice president of technical product management for GE Digital Thread, which is the name for the company’s overall effort to modernize its acquisition processes. She said the use of Tamr in this project across a few GE divisions helped save the $80 million over the past few years.
Tamr competes with Informatica, Ascential (now part of IBM) (IBM) as well as new entrants like Trifacta, Paxata and ClearStory.
According to Solutionsreview, Paxata‘s Adaptive Information Platform offers data integration, quality, and governance capabilities for business analysts. The platform offers flexible deployment options and self-service operation. a visual user interface that has spreadsheet features so users get easily and rapidly familiarized with the tool. Assisted Intelligence that provides algorithmic assistance to infer the meaning of data, and machine learning captures steps for future data work.
ClearStory Data is an enterprise-scale, continuous intelligence analytics solution for complex and unstructured data. It was acquired by Alteryx in April 2019.
This story has been published by Melvine Manchau, views are my own.
The story is part of a series on AI including: