The dataset we are dealing with here is a relatively small dataset and good for experimentation. It deals with a bunch of adults and their characteristics and our goal is to predict if the adult’s salary is ≥ 50k or < 50k.
This is what our data looks like:
Machine learning approach (Full notebook)
The first thing we do is check our data set for missing values.
occupation have a lot of missing values.
We won’t be dropping the rows with missing values. Instead, we will replace them with the median (in case of continuous variables) and a random string (in case of categorical variables). We will also create 2 new columns which will indicate the rows that had missing values as shown.
Finally, we map categorical variables to codes. Some may argue one hot encoding works better however, normal encoding will work for this experiment.
That is it for the pre-processing part. We will now split our data into training and validation sets.
Finally we initialize a Random Forest classifier, train it, tune some hyper-parameters (more on this towards the end), and make our predictions. We see that our model gives a solid 86% accuracy.
We now move on to our deep learning approach to see if it can do equally well.
Deep learning approach (Full notebook)
We will follow the exact same pre-processing steps that we did in the machine learning approach. We will also have the same train / val split. This way we can make a fair comparison between the 2 approaches.
Once our data bunch is created, we can train our model.
An accuracy of 85%. So neural nets do perform as well as a hyper tuned Random Forest. The initial skepticism they faced about not being able to work well with training data can be easily banished.
If you liked this article give it at least 50 claps :p