If we were in a movie on AI, the main character of our story would be a data scientist – let’s call her Ria. Ria works in a multinational company, and one Monday morning she receives a request for her help on a project to build an AI model. The project is a high-visibility project and has the possibility of large revenue savings for the company if Ria and her team can build an AI model to solve the problem. Ria is excited and immediately starts asking for data access so that she and her team can get started on the project. Ria and her team analyze the data to find data quality issues, clean the data, build features, and build a model. After several months, Ria and her team are struggling to build a high-accuracy model. With every iteration, they discover more data quality issues, go back to the design table to brainstorm the issue, figure out ways to fix it, and write the code for data remediation. After weeks and months effort, Ria believes that the whole project would have been more streamlined if they had gotten a report on the data quality when they had gotten the data at the beginning. Does this sound familiar?
Many studies have shown that data preparation is one of the most time-consuming pieces of the machine learning lifecycle. One reason is that the data issues are discovered in a trial and error fashion, new code must be written for every issue found, and someone must keep a manual log of all of the changes applied to the data so that there is a lineage of how the data was changed over the course of building a machine learning pipeline. However, this information, unless explicitly recorded, might not be available.
While data scientists solve these problems today by writing custom scripts or manual analyses, this is a time-consuming process, and some challenges such as finding class overlap or label noise can, by themselves, be AI-based algorithms that might take several months to develop before they can be used in a business project. Moreover, there are other challenges such as different modalities of data like tabular data, time series data, and so on. Therefore, there is a need for automation in this space to consistently assess data across different modalities, explain the assessment, suggest recommendations, and code to run these recommendations.
To overcome these challenges, IBM Research has developed a Data Quality for AI Toolkit that is built using novel algorithms and provides a systematic way to assess and remediate data with well-specified APIs. The toolkit is built to serve a wide variety of use cases such as:
- Building supervised classification models
- Providing data quality for application workflows with intuitive mechanisms to take domain inputs
- Working in the presence of strict privacy constraints by data synthesis
- Automatically reporting on the data quality and capturing the lineage for the data
The toolkit has the following features:
- Validators: Algorithms that perform data quality assessment and output a data quality score from 0 – 1.
- Remediators: Algorithms to provide corrective actions to fix the data quality and impact on the data quality score.
- Constraints: Explicit input provided by domain experts or implicitly derived by analyzing the data characteristics.
- Data Synthesizer: In the event that data cannot be shared due to strict privacy constraints, it provides a capability to synthesize data by learning constraints from real data so that it mimics real data.
- Pipeline: Combines validators and remediators with constraints to address a use case or application workflow and outputs an overall data quality score.
- Data Readiness Report: Automated documentation of changes that records delta changes in quality metrics and data transformations applied.
These features allow a data scientist to understand and systematically address the data quality issues and address them in their data science pipelines. We will be holding a hands-on session on the toolkit at THINK 2021. Join us by signing up.