The ambiguity surrounding Artificial Intelligence is legion. The majority of enterprise proclamations of AI are simply applications of machine learning. Although this technology involves supervised learning, unsupervised learning, and reinforcement learning, misconceptions about these terms—and their use throughout the enterprise—abound.
Many of these misapprehensions are attributed to the names of these forms of statistical AI. For example, some believe that simply using machine learning as a feedback loop is reinforcement learning. Others think fully automated machine learning applications—devoid of human “supervision”—are examples of unsupervised learning.
Nonetheless, the true distinctions of these categories pertain to training data. With reinforcement learning there’s no training data; an agent interacts with an environment and learns based on rewards and punishments. Supervised learning requires (primarily human) labeled training data to teach models to predict desired outcomes.
Unsupervised learning involves training data without labels, in which “the system tries to find kind of a stable set of clusters in your data,” remarked Franz CEO Jans Aasman. “So, the data makes up its own categories.”
This chief value proposition of unsupervised learning—for machine learning models to explore and categorize data based on their findings as opposed to stratifications imposed by humans—is immensely useful for everything from general data discovery to targeted use cases like precision medicine.
When properly implemented, it results in highly specific analytics insight that is otherwise easy to miss.
As Aasman’s previous quotation implies, clustering techniques are the quintessential form of unsupervised learning. Dimensionality reduction is another facet of unsupervised learning that applies to supervised learning as well. Clustering enables machine learning systems to put data into groups based on their similarities. In some instances, people might not be aware of these similarities in their data, which is why clustering is an excellent data discovery technique. For example, organizations can use clustering approaches to populate a knowledge graph with various source data, or to de-duplicate records for ensuring data quality. Some of the most utilitarian measures impacting clustering involve:
- K-Means: This clustering algorithm is partially spatially based and necessitates “kind of dividing all the data in a three-dimensional space into…groups,” Aasman revealed. A potential limitation of this technique is “a human being has to decide how many clusters you want,” Aasman said. “It’s like you’re god and you already know.” Other clustering techniques, however, surmount this limitation and “let the software figure what is the most stable number of clusters that explain the variability of the data the best,” Aasman noted.
- Topological Analysis: Topological data analysis is one such modern clustering method with the advantage Aasman referenced. Its clusters are typically smaller and more numerous than those of K-Means. This technique is critical for devising comprehensive digital twins for entire production environments. It’s also useful in healthcare. According to Aasman, diabetes is usually characterized as “diabetes 1 and 2 or general diabetes.” In a specific use case, Aasman found that topological analysis of patients with type 2 diabetes and real failure resulted in “22 clusters of these patients.”
- Additional Learning: Clustering becomes truly valuable in graph settings in which organizations can input individual clusters back into graphs to detect clusters within clusters for highly specific findings. This practice is used in certain healthcare settings to facilitate personalized medicine, partly by giving cluster identifiers to each patient with a certain medical condition, like diabetes. “This is how you can start doing precision medicine where you can put people in clusters of things and see if treatments work better for this cluster than that cluster,” Aasman commented.
The basic precept of dimensionality reduction applies to both supervised and unsupervised learning. Aasman outlined that concept as “can I use less variables or can I clump variables together so that I don’t need 400 variables, but I can only use three.” The primary objective of this approach is to reduce the number of inputs and still get the same quality of predictions. Principle Component Analysis (PCA) is an unsupervised learning form of dimensionality reduction in which data scientists sift through a range of values to predict others by analyzing the correlations in their data.
Aasman mentioned this computationally expensive process requires one to “look at every correlation between every variable in your entire dataset…and then you find what they call factors in your data.” Factors are input data attributes that impact a machine learning model’s output. PCA not only supports clustering datasets into factors, but also decreasing the number of variables within them so “for each factor you find the most important variable that explains most of the variability,” Aasman indicated. Honing in on these variables is essential to building models with accurate predictions. Via PCA, one can deconstruct a particular factor and “go from 100 variables to maybe 20 or 10 that kind of explain everything,” Aasman confirmed.
As the PCA example suggests, clustering is the nucleus of many unsupervised learning approaches, even when they involve dimensionality reduction. For example, organizations can utilize clustering to find similarities in customer data for intelligently targeting advertising to those groups. Aasman described a social media use case in which cluster algorithms identified groups of people on Facebook interested in gardening. Thus, for organizations with “anything related to gardening you’d better post in these gardening groups,” Aasman said.
Optimizing this form of unsupervised learning by assigning cluster identifiers to individual business concepts (like specific patients or customers), finding clusters within clusters, and personalizing next actions based on results requires nuanced, relationship-savvy settings. These environments are necessary to fully understand the output of analytics by performing analytics on those results.
With machine learning, it’s distinctly possible analytics results are graphs in which “A influences B, B influences C, B also influences E, E inhibits B: [there’s] all these positive, negative, neutral relationships between all these variables,” Aasman explained. “If you want to understand the correlations or what’s in your data then it’s essential to visualize your data.” Visualization mechanisms in graph settings are primed to illustrate these relationships in an intuitive way that’s difficult to duplicate in relational settings. Without visualizations, the output of analytics can be “so big, so complicated, that it’s indigestible for a human being,” Aasman cautioned.
Because misunderstandings about unsupervised learning will likely continue for the time being, it’s important to comprehend what this element of statistical AI actually is and isn’t. Unsupervised learning doesn’t simply mean there’s not a human overlooking the results of machine learning. This term is not a synonym for the human-in-the-loop notion that, ultimately, all facets of statistical AI should involve. Unsupervised learning is the machine learning variety in which learning isn’t based on (mostly human) labeled examples of training data, but on the actual data themselves.
That learning largely pertains to detecting similarities in data apposite to creating groups. These groups may initially elude human notice, making clustering mechanisms involving K-Means or topological analysis well suited for data discovery. Moreover, these measures are pivotal for compartmentalizing data into groupings for enhanced personalized interactions. Whether deployed in precision medicine or personalized marketing, churn reduction or fraud detection, these approaches excel at differentiating data into core types that maximize business undertakings, like saving lives or converting potential customers to current ones.
About the Author
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.
Sign up for the free insideBIGDATA newsletter.
Credit: Google News