1. Fundamentals of AI, ML and Deep Learning for Product Managers
2. The Unfortunate Power of Deep Learning
3. Graph Neural Network for 3D Object Detection in a Point Cloud
4. Know the biggest Notable difference between AI vs. Machine Learning
Next, for each document, the term frequency (Tf)is calculated. This is simply a count of how often each term appears in the document. Since each advertisement only has 5 or 6 words and each word only appears once, the term frequency is never higher than 1 for each document.
With term frequency for each document, a matrix multiplication is done with the term frequencies and inverse document frequencies to arrive at the final TF-IDF vector for each document.
Note that the scores for the words that appear in Advertisement 2 receive a are always a bit higher than the scores for the words in Advertisement 1. Since Advertisement 2 contains fewer words than Advertisement 1, each word is counted as relatively more important.
To better understand how TF-IDF vectorizes our Spanish real estate data set, we’ll look at the same example we used in Part 1 of analyzing subsets of “cheap” and “expensive” homes in our data. Recall from Part 1 that we defined “cheap” homes as the most inexpensive 5% of our data, those under €75,000, and the “expensive” homes as the most expensive 5% of our data, those above €1.7 million.
The TF-IDF scores for the cheapest and most expensive properties are shown below. The TF-IDF scores shown are the sum of all scores for the word for each advertisement that was included in the “cheap” or “expensive” category.
These words are many of the same words that were included in the “cheap” and “expensive” wordclouds.
The TF-IDF vectorizer has a few hyperparameters that, when adjusted, change the vector they create. Perhaps the most important of these hyperparameters are `min_df` and `max_df`. Min_df defines the minimum number of documents in which a word must appear in order for it to be counted. Setting this value to 0.05, for example, means that words which appear in only 5% of the documents, or less, are not included. In the context of real estate listings, this would likely exclude words like a particular street name or seldom-used adjective that only occur in one advertisement and can prevent overfitting. Max_df, on the other hand, defines the maximum number of documents in which a word can appear. This prevents words which appear in almost every listing from being included in the feature vector. Terms like “for sale” would likely be excluded with this metric.
Below is a list of some of the words which were excluded with the min_df=0.02 and max_df=.90 with our real estate dataset. This means we excluded words that don’t exist in at least 2% of all property advertisements, as well as words that exist in more than 90% of all advertisements.
selected_exluded_words = ['kennel', 'ciencias', 'mayores', 'castiilo', 'montroy', 'worthy', 'furniture', 'ricardo', 'fend', 'españa', 'iron', 'rotas', 'sans', 'alike', 'portals', 'dividable', 'majestically', 'ladder', 'communicate', 'orientation', 'grass',
'visited', 'identify', 'setting', 'café', 'specimen', 'dorm', 'unsurpassed', 'later', 'tarred', 'oil']
Limiting the NLP features considered in this way decreased the dimensionality of our TF-IDF feature matrix from 13,233 columns to 158 columns, meaning 158 terms were then used to train the model. This drastically decreases the dimensionality of the NLP feature vector, as well as decreasing potential noise.
These 158 additional features were then fed in as additional training features to the XG-Boost model. The model’s hyperparameters were also tuned using GridSearchCV.
The improvements in performance were quite surprising. The best MAPE score achieved on the first, and hardest to predict, quintile of data using the baseline features was a 46.74 % error. Including the 158-feature TF-IDF matrix, this error was cut nearly in half to 27.01%.