KMeans: Categorize An Unlabelled Dataset [Jupyter Notebook]
In this article, I would like to analyze a field of machine learning very useful in everyday's life, simple to realize and with “immediate results.”
A lot of businesses like shops, supermarkets, shopping malls, e-commerce, but also internet websites, web applications like Spotify, Rotten Tomatoes or Amazon have the necessity to look for patterns between their customers. This because it is useful to create categories where to group similar “subjects”; those techniques are part of a subfield in machine learning, called clustering or unsupervised learning.
Top 4 Most Popular Ai Articles:
1. The AI Job Wars: Episode I
2. TensorFlow Object Detection API tutorial
3. Deep Learning Book Notes, Chapter 1
4. Visual Music & Machine Learning Workshop for Kids
In this article, I will take under analysis a dataset containing information stored in a database owned by the management of a shopping mall, the database contains different pieces of information, such as PII, historical data about customer transactions and the total amount spent. For this article, I decided (and obtained the permission) to use only two columns: the total amount spent in the last month and the percentage spent in CPG. The percentage spent in CPG was calculated by dividing the total amount spent for each customer by the amount spent in certain shops tagged as CPG retailer (supermarkets, bakeries, perfumeries and so on..).
The tool that gives the possibility to group similar objects in categories is called K-Means, it works in a very simple way:
- The algorithm choose n random centroids (the number of centroids is specified a priori, by the user.)
- Each observation that falls near a centroid is assigned to the closest centroid k.
- Each centroid k is moved as close as possible to the assigned category.
- Those three steps are looped until a maximum number of iterations is reached or until the algorithm finds another centroid k (new category).
The following gif is very helpful to understand better how does K-Means works:
But how the similarity between two observations is measured?
It is possible to define the term similarity as the opposite of distance, so as more as two observations are close to each other the more they have something in common. That is why a very common technique to measure the distance is to compute the square of the Euclidian distance between two observation in an n-dimensional space.
Basing on the previous assumptions it is possible to define the K-Means algorithm as an optimization problem:
As it is possible to realize from the formulas it seems that K-Means would only work with continuous variables and not categorical but there is also a way to implement K-Means using categorical data, this is called K-Modes, described in this paper by Huang and implemented in Python here.
To dive in about how this tool works I suggest reading this post by Oracle’s blog.
The Python implementation:
To implement this algorithm I will use a Python library called sklearn . After querying the database and obtaining the data I calculated the percentage spent in CPG and plotted the data on a graph:
As it is mandatory to specify a priory the number of clusters for which we want to do the segmentation, there are different techniques to find the optimal number of clusters, but as sometimes happens, some clusters are already visible after just simply plotting the data. In other situations, for example when labelling to which category a film is part of with only having the audio transcription is necessary to implement some techniques that help find the optimal number of clusters; it is important to stress also that in some situations it is the management of a company that propose a number of categories for which look for.
For this example seems that four categories are present, so I will look to create four different clusters; the sklearn library gives the opportinuty to implement this algorithm in a very simple way:
After running the algorithm it is possible to plot the categories assigning for each of them a different color and marking with a “+” each centroid:
The final goal of the K-Means algorithm is certainly not to make a very colorful plot. This technique is a starting point for further analysis: after labeling a dataset it is possible to conduct some supervised learning on it, and this gives the ability to predict in which category new observations would fall into, this would be very useful when, for example, doing a marketing campaign with different targets.
Another very important thing I want to point out is that in this article I only used two different variables, the total amount spent and the percentage spent in CPG: this gives the opportunity to create two dimensional plots, but it is important to point out that adding, for example, a third parameter, like the customer age, would change our problem in a significant way:
This to point out how in clustering it is not very easy to claim what is correct and what is not, but it is more a field where every situation is a different case study and sometimes some technical decisions aren’t coming from the results of some optimization methods, but rather from the necessity of the customer or the management.
Thanks for getting till here! 🤖
Here is the Jupyter Notebook I created using similar data to the data described in this article! ENJOY IT!
If you liked it please give me a small clap or maybe:
Don’t forget to give us your 👏 !
Unsupervised Learning: How To Categorize An Unlabelled Dataset? was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.