Friday, March 5, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Neural Networks

Unsupervised Learning: How To Categorize An Unlabelled Dataset?

February 7, 2019
in Neural Networks
Unsupervised Learning: How To Categorize An Unlabelled Dataset?
600
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Credit: BecomingHuman

KMeans: Categorize An Unlabelled Dataset [Jupyter Notebook]

In this article, I would like to analyze a field of machine learning very useful in everyday's life, simple to realize and with “immediate results.”

A lot of businesses like shops, supermarkets, shopping malls, e-commerce, but also internet websites, web applications like Spotify, Rotten Tomatoes or Amazon have the necessity to look for patterns between their customers. This because it is useful to create categories where to group similar “subjects”; those techniques are part of a subfield in machine learning, called clustering or unsupervised learning.

You might also like

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021

The Examples and Benefits of AI in Healthcare: From accurate diagnosis to remote patient monitoring | by ITRex Group | Mar, 2021

Top 4 Most Popular Ai Articles:

1. The AI Job Wars: Episode I

2. TensorFlow Object Detection API tutorial

3. Deep Learning Book Notes, Chapter 1

4. Visual Music & Machine Learning Workshop for Kids

In this article, I will take under analysis a dataset containing information stored in a database owned by the management of a shopping mall, the database contains different pieces of information, such as PII, historical data about customer transactions and the total amount spent. For this article, I decided (and obtained the permission) to use only two columns: the total amount spent in the last month and the percentage spent in CPG. The percentage spent in CPG was calculated by dividing the total amount spent for each customer by the amount spent in certain shops tagged as CPG retailer (supermarkets, bakeries, perfumeries and so on..).

CPG, AKA soft drinks in a supermarket.

The tool that gives the possibility to group similar objects in categories is called K-Means, it works in a very simple way:

  1. The algorithm choose n random centroids (the number of centroids is specified a priori, by the user.)
  2. Each observation that falls near a centroid is assigned to the closest centroid k.
  3. Each centroid k is moved as close as possible to the assigned category.
  4. Those three steps are looped until a maximum number of iterations is reached or until the algorithm finds another centroid k (new category).

The following gif is very helpful to understand better how does K-Means works:

Source: http://mcla.ug/blog/k-means-clustering.html

But how the similarity between two observations is measured?

It is possible to define the term similarity as the opposite of distance, so as more as two observations are close to each other the more they have something in common. That is why a very common technique to measure the distance is to compute the square of the Euclidian distance between two observation in an n-dimensional space.

j refers to the j-th dimension of the dataset, aka the coordinates x and y that represent each observation in the dataset.

Basing on the previous assumptions it is possible to define the K-Means algorithm as an optimization problem:

With μ(j) for each centroid j.

As it is possible to realize from the formulas it seems that K-Means would only work with continuous variables and not categorical but there is also a way to implement K-Means using categorical data, this is called K-Modes, described in this paper by Huang and implemented in Python here.

To dive in about how this tool works I suggest reading this post by Oracle’s blog.

The Python implementation:

To implement this algorithm I will use a Python library called sklearn . After querying the database and obtaining the data I calculated the percentage spent in CPG and plotted the data on a graph:

The dataset contains 4647 observations.

As it is mandatory to specify a priory the number of clusters for which we want to do the segmentation, there are different techniques to find the optimal number of clusters, but as sometimes happens, some clusters are already visible after just simply plotting the data. In other situations, for example when labelling to which category a film is part of with only having the audio transcription is necessary to implement some techniques that help find the optimal number of clusters; it is important to stress also that in some situations it is the management of a company that propose a number of categories for which look for.

For this example seems that four categories are present, so I will look to create four different clusters; the sklearn library gives the opportinuty to implement this algorithm in a very simple way:

After running the algorithm it is possible to plot the categories assigning for each of them a different color and marking with a “+” each centroid:

The final goal of the K-Means algorithm is certainly not to make a very colorful plot. This technique is a starting point for further analysis: after labeling a dataset it is possible to conduct some supervised learning on it, and this gives the ability to predict in which category new observations would fall into, this would be very useful when, for example, doing a marketing campaign with different targets.

Another very important thing I want to point out is that in this article I only used two different variables, the total amount spent and the percentage spent in CPG: this gives the opportunity to create two dimensional plots, but it is important to point out that adding, for example, a third parameter, like the customer age, would change our problem in a significant way:

This to point out how in clustering it is not very easy to claim what is correct and what is not, but it is more a field where every situation is a different case study and sometimes some technical decisions aren’t coming from the results of some optimization methods, but rather from the necessity of the customer or the management.

Thanks for getting till here! 🤖

Here is the Jupyter Notebook I created using similar data to the data described in this article! ENJOY IT!

If you liked it please give me a small clap or maybe:

Don’t forget to give us your 👏 !

https://medium.com/media/c43026df6fee7cdb1aab8aaf916125ea/href


Unsupervised Learning: How To Categorize An Unlabelled Dataset? was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Credit: BecomingHuman By: Roberto Sannazzaro

Previous Post

Marketing Metrics That Impact a Business's Bottom Line

Next Post

AI technology addresses parts accuracy, a major manufacturing challenge in 3-D printing

Related Posts

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021
Neural Networks

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

March 5, 2021
8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021
Neural Networks

8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021

March 5, 2021
The Examples and Benefits of AI in Healthcare: From accurate diagnosis to remote patient monitoring | by ITRex Group | Mar, 2021
Neural Networks

The Examples and Benefits of AI in Healthcare: From accurate diagnosis to remote patient monitoring | by ITRex Group | Mar, 2021

March 4, 2021
3 Types of Image Segmentation. If you are getting started with Machine… | by Doga Ozgon | Feb, 2021
Neural Networks

3 Types of Image Segmentation. If you are getting started with Machine… | by Doga Ozgon | Feb, 2021

March 4, 2021
The Role Of Artificial Intelligence In The Fight Against COVID | by B-cube.ai | Feb, 2021
Neural Networks

The Role Of Artificial Intelligence In The Fight Against COVID | by B-cube.ai | Feb, 2021

March 4, 2021
Next Post
AI technology addresses parts accuracy, a major manufacturing challenge in 3-D printing

AI technology addresses parts accuracy, a major manufacturing challenge in 3-D printing

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Convergence of AI, 5G and Augmented Reality Poses New Security Risks 
Artificial Intelligence

Convergence of AI, 5G and Augmented Reality Poses New Security Risks 

March 5, 2021
2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms
Machine Learning

2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms

March 5, 2021
With its acquisition of Auth0, Okta goes all in on CIAM
Internet Security

With its acquisition of Auth0, Okta goes all in on CIAM

March 5, 2021
Survey Finds Many Companies Do Little or No Management of Cloud Spending  
Artificial Intelligence

Survey Finds Many Companies Do Little or No Management of Cloud Spending  

March 5, 2021
UVA doctors give us a glimpse into the future of artificial intelligence
Machine Learning

UVA doctors give us a glimpse into the future of artificial intelligence

March 5, 2021
Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021
Neural Networks

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

March 5, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Convergence of AI, 5G and Augmented Reality Poses New Security Risks  March 5, 2021
  • 2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms March 5, 2021
  • With its acquisition of Auth0, Okta goes all in on CIAM March 5, 2021
  • Survey Finds Many Companies Do Little or No Management of Cloud Spending   March 5, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates