The Titanic challenge hosted by Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat.
I have been playing with the Titanic dataset for a while, and I have recently achieved an accuracy score of 1.0 on the public leaderboard. As I’m writing this post, I am ranked 113th out of 11002 participants.
You must be wondering how did I manage to achieve this. So without further ado let’s just jump straight into it.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
The dataset can be downloaded from the kaggle website which can be found here.
You don’t need to reinvent the wheel, you need to know how to use the wheel to make your car better.
The corresponding project can be found in my github repository here.
Below are the features provided in the Test dataset.
- Passenger Id: and id given to each traveler on the boat
- Pclass: the passenger class. It has three possible values: 1,2,3 (first, second and third class)
- The Name of the passenger
- SibSp: number of siblings and spouses traveling with the passenger
- Parch: number of parents and children traveling with the passenger
- The ticket number
- The ticket Fare
- The cabin number
- The embarkation. This describe three possible areas of the Titanic from which the people embark. Three possible values S,C,Q
Let’s start with loading all the libraries and dependencies.
Let’s print the first three rows.
Let’s do some feature engineering. I have created a feature for telling if a passenger has a cabin or not, for calculating family size and for telling if a person is alone. Also I have done some data cleaning like removing null values.
I continued feature engineering with creating a function to get title of a person. Also I have grouped all the non common titles into a single group. Then I splitted people by sex, title, embarked, fare and age.
Then I removed some of the columns like PassengerId, name, ticket, cabin, sibSp as these values are not important for our work.
Let’s see what we have got till now.
I continued the work with making a confusion matrix to better visualize the co-relation between sample features.
So far so good. Then I created a feature named title, mapped sex as a binary feature and created a table with sex distribution grouped by title.
Let’s see what we have got till now.
After that I wrote a function to calculate gini-impurity score using count of people who survived as a fraction of people who were on board.
I played with this function with arbitrary values until I was satisfied with the result. Feel free to refer to the notebook for the code.
Now for the final part, I used K-Fold cross validation model with ten splits. Initially I was getting good values with random forest model but later found decision trees to outperform them in accuracy. Hence I have used decision tree in this project.
I think this competition is a good starting point for someone who is starting out on their data science/machine learning journey. One can play around with different models like logistic regression, random forest, naive bayes, support vector machines etc. Also this competition should be a good test bed for trying out some of the more intricate algorithms like xgboost, autoencoders, gradient boosting or an ensemble of above algorithms.
The corresponding source code can be found here.
Happy reading, happy learning and happy coding.