The worlds most valuable resource is no longer oil, but data.
That was the headline from a post from The Economist back in 2017. It’s obviously a figurative way to say it.. or isn’t?
Among all possible ways to take raw data and turn it into something valuable you’ll find the reason why I’m writing this post – Market Basket Analysis. That a common technique used specially by large retailers to find hidden patterns on customer behaviors.
The easiest way I could define it would be like,
If the customer has Hamburgers and Bread on his supermarket basket, the chances of Ketchup also being purchased is (lift).
What I meant by lift is, one of the outputs from the algorithm. But we’ll get into more details along the code.
Two quick must knows about Apriori Algorithm:
- The association rules found will always implicate co-occurrence, never causality.
- The name Market Basket is a cool way to relate it on how the Algorithm expects the data to be modeled. Therefore, transactional style.
Let’s get into it.
Preparing the Data
If you already had the chance of reading my other posts of this Practical Implementation Series, you know I always concern people about the importance of pre processing stage.
This one is no different. Our data has to be set on a very specific way and I’ll show you how, using a random dataset containing groceries purchases from a supermarket.
This is the way my data looks at first:
On this picture I’m showing you just the first columns, but actually it contains 32 on total. So, in this particularly dataset the maximum amount of items that could be on a customer’s basket is 32.
Notice that we have NaN values as well. That’s fine. This means that the respectively fields were not used.
Take the first row as example, that transaction contains only four products. A unique transaction ID for each purchase don’t actually have to exist for the association rules be created as well.
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle
This data is not modeled correctly for Apriori Algorithm!
To fix it, we’re gonna have to take all possible existing products on the dataset and turn them into columns. After that, we’ll just flag each column as 0 or 1 indicating if the product was or wasn’t on each transaction.
I said 0 or 1 but you can go with True or False if you prefer, it’s pretty much the same.
A good way to keep track you’re doing things properly is comparing the amount of rows before and after the change. You see, we’re still going to have information about all transactions, but the appearance will be different, that’s all.
There’s a lot of different ways of making these changes, but the easiest way I can think of right now is by using unstack() and pivot_table() pandas functions.
I ended up with 169 columns, which means we had 169 distinct products in this dataset.
The amount of rows remains the same, 9.835 transactions.
While using pivot_table() I was able to turn all values to 0 or 1. That’s because we knew the aggregation function count could only count the product itself. We also filled values as zero (another possible parameter on the function).
As always, implementing the model itself is the easiest part (since we’re using Python Templates). I thought it would be better to talk about the concept of lift at this point of the post.
In the next lines of code I’ll import the mlxtend package and find the association rules based on the function parameters:
The first function apriori() requires me to parse the dataset we modeled before. The other two parameters being used will be set as default values, but I brought them out to talk about them.
Before that let me quickly show you how the final result looks like:
min_support make us having to set a minimum value for the support of each existing product.
The definition of support of a product would be the amount of times it appears on the baskets among all transactions made. So let’s say that from 100 transactions (baskets), Ketchup is in only 3 of them. Ketchup support is 3/100 = 0.03.
- If a product has low values of support, the Algorithm just won’t find relevant associations with that product.
This kinda of makes sense right? If there’s not enough information about the product there’s also not enough conclusions to be taken from it.
The reason why this parameter existis on the function is because calculating all possible rules based on low values of support can be very computing expensive. The lower min_support values are, the higher Algorithm found rules will be.
use_colnames is just for the function return you the actual name of the products and make the analysis easier.
On the second function, you see I’m using lift as the metric. I can choose between all existing metrics of the Algorithm to parse a min_threshold.
Not getting into how lift is calculated, the interpretation of it would be like:
Having yogurt and whole milk on the basket increases the chances of curd also be there by 3.3
On the dataset above you see the products that seem to have the highest associations within each other. Again, always co-occurrence and never causality!
Rules that have values of lift below 1 are not good. In this case, the rule would mean that chances of having Y, if X is in the basket are lower, since X is there.
My objective here is to give you some vision to the practical part of Market Basket Analysis. If your goal is to get into a more theoric and detailed explanation, I found a really good post about it on Medium and I’ll leave the link to it below.
Before You Go
If you haven’t realize it yet let me put here the full code we developed:
It’s just crazy seeing how we extracted such valuable information to a random supermarket business with 9 lines of code…
What a good moment to be alive on man kind history, dear reader!
If you liked this reading, please don’t forget to check out my other two posts from this Practical Implementation Series!