Collaborative filtering is an application of machine learning where we try to predict whether a user will like a particular movie or product. We do so by looking at the user’s previous buying habits. In its simplest form, collaborative filtering involves just 3 columns, the userID, songID and the rating the user gave to that song. And it’s exactly what we’re gonna use in this article.
Full Jupyter notebook.
Trending AI Articles:
– Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data
– Data Science Simplified Part 1: Principles and Process
– Getting Started with Building Realtime API Infrastructure
– How I used machine learning as inspiration for physical paintings
There are 2 ways we can store the data for collaborative filtering. We can either store it in a
.csv as shown above or we can use a
matrix. The rows of the matrix will represent users, the columns will represent items and the value of that particular cell will represent the rating the user gave that particular item.
However, if we store it this way, we are going to end up with a very sparse matrix since most users won’t have watched most movies or bought most products. Also the matrix will be enormous and waste a lot of storage space. Hence we will store it as a
The way we approach the collaborative filtering problem is we assign a set of weights to each user and a set of weights to each item and the dot product of these things will be the rating by the user for that item.
These weight matrices are called embeddings and will represent features like song genre or user taste. But these are not enough. Maybe there are users that just like most songs or songs that are loved by most people. Hence instead of just taking the dot product, we add a user bias and song bias to our prediction to account for the same. We then use gradient descent to update our weights until we get good enough results.
Let’s see how we do this in code.
As always, we start by importing the fast.ai libraries.
We check the length of our dataset and the distribution of the ratings.
We then create a DataBunch and use 20% of our data as part of our validation set.
Note the seed parameter in the DataBunch creation step. A lot of people confuse this with
We use a seed when we want our results to be reproducible. This means that we are taking a random 20% of our data as part of our validation set, however, we want to make sure that the next time we run the code, we get the same (random) split. Also, we cannot use
np.random.seed() because that sets a seed in
numpy and in fastai we are using
Pytorch behind the scenes.
Once we have created a DataBunch we initialize a learner. The dot product and the biases are going to give us some predictions. These predictions can also be negative, or above our maximum value (5). Eventually, after a lot of training our model will learn to predict values in the range. But we want to save our model some work, and tell it the range in which we want our predictions to be. Hence we pass it the
y_range parameter. This will add a sigmoid layer at the end.
We choose values slightly above and below our minimum and maximum value to keep the predictions in range.
n_factors parameter determines the size of our embeddings. There is anther parameter we’ve passed called
weight decay but we will find out what that is in our next article.
Finally we find the learning rate and train our model.
The dataset is a relatively easy one we get pretty good results. Also, since the size of our embeddings is just a parameter, we can run a
for loop and pass in different size of embeddings to check which one gives the best result.
This is how we solve a collaborative filtering problem. We can interpret the model, find the most biased songs or the least biased users. However, for that we would require another
csv telling us which id corresponds to which song or user.
Problems with Collaborative filtering: Cold start
When we have enough data about a user or a song we can do a pretty good job at predicting the outcome. However, when we really want to recommend songs to users is when a new user joins our platform. Or when an artists releases a new song. In this case, we have no previous data to rely on, and hence recommending is difficult. Companies take a variety of approaches to solve this problem, like asking its users to select the type of movies he loves, but there’s no clear solution to the problem as yet.
As a final note, I would urge my readers to really work on their data preparation skills. When we take part in Kaggle competition or download a dataset online, all the data cleaning and per-processing steps are already done for us. However in the real world, you will have deal with raw data, and you will have to deal with multiple data sources. How you work your way around those sources will determine if you turn out to be a good data scientist.
So work on those skills, because modelling, it’s just part of the process (and a tiny one at that.)
If you liked this article give it atleast 50 claps :p