The hidden secret of artificial intelligence is that much of it is actually powered by humans. Well, to be specific, the supervised learning algorithms that have gained much of the attention recently are dependent on humans to provide well-labeled training data that can be used to train machine learning algorithms. Since machines have to first be taught, they can’t teach themselves (yet), so it falls upon the capabilities of humans to do this training. This is the secret achilles heel of AI: the need for humans to teach machines the things that they are not yet able to do on their own.
Machine learning is what powers today’s AI systems. Organizations are implementing one or more of the seven patterns of AI, including computer vision, natural language processing, predictive analytics, autonomous systems, pattern and anomaly detection, goal-driven systems, and hyperpersonalization across a wide range of applications. However, in order for these systems to be able to create accurate generalizations, these machine learning systems must be trained on data. The more advanced forms of machine learning, especially deep learning neural networks, require significant volumes of data to be able to create models with desired levels of accuracy. It goes without saying then, that the machine learning data needs to be clean, accurate, complete, and well-labeled so the resulting machine learning models are accurate. Whereas it has always been the case that garbage in is garbage out in computing, it is especially the case with regards to machine learning data.
According to analyst firm Cognilytica, over 80% of AI project time is spent preparing and labeling data for use in machine learning projects:
(Disclosure: I’m a principal analyst at Cognilytica)
Fully one quarter of this time is spent providing the necessary labels on data so that supervised machine learning approaches will actually achieve their learning objectives. Customers have the data, but they don’t have the resources to label large data sets, nor do they have a mechanism to insure accuracy and quality. Raw labor is easy to come by, but it’s much harder to guarantee any level of quality from a random, mostly transient labor force. Third party managed labeling solution providers address this gap by providing the labor force to do the labeling combined with the expertise in large-scale data labeling efforts and an infrastructure for managing labeling workloads and achieving desired quality levels.
According to a recent report from research firm Cognilytica, over 35 companies are currently engaged in providing human labor to add labels and annotation to data to power supervised learning algorithms. Some of these firms use general, crowdsourced approaches to data labeling, while others bring their own, managed and trained labor pools that can address a wide range of general and domain-specific data labeling needs.
As detailed in the Cognilytica report, the tasks for data labeling and annotation depend highly on the sort of data to be labeled for machine learning purposes and the specific learning task that is needed. The primary use cases for data labeling fall into the following major categories:
- Image classification / tagging / annotation
- Speech and text tagging and labeling speech and text
- 3D Point Cloud tagging
- Sentiment Analysis
- Conversational tagging
- Relevance and personalization labeling
- Knowledge Graph development
These labeling tasks are getting increasingly more complicated and domain-specific as machine learning models are developed that can handle more general use cases. For example, innovative medical technology companies are building machine learning models that can identify all manner of concerns within medical images, such as clots, fractures, tumors, obstructions, and other concerns. To build these models requires first training machine learning algorithms to identify those issues within images. To train the machine learning models requires lots of data that has been labeled with the specific areas of concern identified. To accomplish that labeling task requires some level of knowledge as to how to identify a particular issue and the knowledge of how to appropriately label it. This is not a task for the random, off-the-street individual. This requires some amount of domain expertise.
Consequently, labeling firms have evolved to provide more domain-specific capabilities and expanded the footprint of their offerings. As machine learning starts to be applied to ever more specific areas, the needs for this sort of domain-specific data labeling will only increase. According to the Cognilytica report, the demand for data labeling services from third parties will grow from $1.7 Billion (USD) in 2019 to over $4.1B by 2024. This is a significant market, much larger than most might be aware of.
Increasingly, machines are doing this work of data labeling as well. Data labeling providers are applying machine learning to their own labeling efforts to perform some of the work of labeling, perform quality control checks on human labor, and optimize the labeling process. These firms use machine learning inferencing to identify data types, things that don’t match the structure of a data column, potential data quality or formatting issues, and provides recommendations to users for how they could clean the data. In this way, machine learning is helping the process of improving machine learning. AI applied to AI. Quite interesting.
For the foreseeable future, the need for human-based data labeling for machine learning will not diminish. If anything, the use of machine learning continues to grow into new domains that require new knowledge to be built and learned by systems. This in turn requires well-labeled data to learn in those new domains, and in turn, requires the services of the hidden army of human laborers making AI work as well as it does today.
Credit: Google News