Using unsupervised learning techniques to create features for supervised price prediction.

Clustering has many applications. Most people know it as an unsupervised learning technique. Here, we use clustering to find similarities in observations of real estate listings and allocate similar observations into buckets, but clustering can also be used for other use cases like customer segmentation.

Whenever a new property comes on the market, the question of how it should be priced naturally arises. One good approximation for that is to see how similar properties are priced. The goal here is to define what exactly makes one property similar to another. Clustering can be helpful here to identify which larger category of properties a given property belongs, and which features influence belonging in this category. One can then use the average from this category to get an indication of the price.

Another use case for clustering is to use the cluster information as an additional feature within a supervised prediction model. The reason why that could be a helpful is that a cluster variable provides condensed information to the model. A tree model, for example, has an easier time splitting observations based on one variable instead of splitting based on many variables individually.

To explain that in more depth, let us assume a Random Forest model, which is a special form of bagging. It is special in the sense that it does not consider all features for every tree, but chooses random features for every tree. If a tree chooses the features number of bathrooms and bedrooms but not the number of square meters, it has an incomplete picture. Even though the Random Forest model tries to average out this incompleteness through a majority vote of trees, the tree would have an easier time to have variable which condenses much of the basic information of the property already. That would lead to fewer misclassifications of properties which show an unexpected value for one of their features.

A more elaborate post on the usefulness of using a cluster variable within a supervised predction model is shown here.

The clustering method used in this example is K Means. The reasons for choosing that clustering method compared to more complicated methods are:

- Simplicity — K Means, compared to many other clustering methods, is relatively easy to understand and to implement. Given that clustering is used as a simple additional feature, the use case does not require an overly complex model.
- Adaptive approach — The algorithm is known to easily adapt to new observations. This means that newly added data can easily classified with this algorithm
- Guaranteed convergence — by trying to minimize the total SSE as an objective function over a number of iterations.

Before jumping into any results it is important to elaborate a bit more on how the K Means algorithm works. The first step, and at the same time one of the most difficult steps within this approach, is to set how many clusters we want our data to be split into. This is challenging — how are we supposed to know how many clusters the data needs? We will elaborate a bit later on how to make an educated guess on that number. For now, we will simply call that mysterious number of clusters *k.*

We start by taking *k *randomly selected observations from our dataset. In our case, we choose *k *randomly chosen properties. These are our initial centroids. A centroid a fancy name for the mean of a cluster. Given that this is our first observation within each cluster, it therefore also represents the mean. Afterwards, we assign all other properties left (*N-k*) to exactly one of the *k *groups. This is done by calculating the difference between the features of a property to all possible centroids. Having more than one feature requires us to use the euclidean distance for that. The euclidean distance is the square root of the sum of all squared features of a property.

After we have assigned all properties to one cluster, we re-calculate the means (centroids) of each of the *k *groups. This is done by calculating the simple average of all observations for each cluster. This calculation marks the end of the first iteration.

The next iteration starts again by allocating all properties to one of the *k *centroids again, using the euclidean distance of each property. It could now be that some observations which were assigned to a certain group change over to a different of the *k *groups. If that happens, that means that the model has not yet converged and needs more iterations.

1. Fundamentals of AI, ML and Deep Learning for Product Managers

2. The Unfortunate Power of Deep Learning

3. Graph Neural Network for 3D Object Detection in a Point Cloud

4. Know the biggest Notable difference between AI vs. Machine Learning

After many iterations, no observation will change its group assignment anymore. When that happens, the algorithm is done and said to have converged. This also means that the summed variation of all *k *groups has reached a minimum.

K Means is heavily dependent on which initial values were randomly chosen at the start. In order to test the robustness of the result, it is advised to apply algorithms such as k-means seeding. These methods test different initial values and assess the change in results.

The entirety of the explanation above can also be summarized in the pseudo code below: