In my previous post (EDA of this Titanic dataset), I found that the values of Cabin column contains plenty of missing values. Thus, I decided to fill that out with U, which stands for “Unknown”. It can simply be achieved using fillna() method.
df['Cabin'] = df['Cabin'].fillna('U')
Next, I also found that the values of that column are a letter followed with several numbers (also explained in the previous post). What I wanna do now is to extract all those initial characters. My approach here is to employ lambda function like this:
df[‘Cabin’] = df[‘Cabin’].apply(lambda x: x)
Now that all values of Cabin column have been updated to only a single letter. The next step to do is to convert the value of this column into one-hot format. To do that, I will use the exact same method as what we have done to Embarked column.
cabin_one_hot = pd.get_dummies(df['Cabin'], prefix='Cabin')
df = pd.concat([df, cabin_one_hot], axis=1)
You probably might be thinking at the first place that we don’t even need to take into account the values of Name column as it only holds the name of a person. Theoretically, name will never affect the survival chance of a person. And yes, I do agree with that. However, if we pay closer attention to its contents, we are going to find something interesting: title.
Those titles may be a good feature to consider whether this person is survived or not. Therefore, we are going to take these titles using get_title() function that we declare manually by ourselves.
Now as the function has been declared, we can just apply that function to Name column and store the result to a new column Title.
df['Title'] = df['Name'].apply(get_title)
If you want, you can also check the unique values stored in Title column using df[‘Title’].unique() command. The output is going to look something like this:
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
Similar to the Cabin column, we are going to convert the values of Title into one-hot representation because up to this stage its values are still in form of categorical data. Below is my approach to do so.
title_one_hot = pd.get_dummies(df['Title'], prefix='Title')
df = pd.concat([df, title_one_hot], axis=1)
Well, I guess there’s no much thing to say here. We know that there are only two values in in Sex column, namely female and male, which we know that this is also a categorical data. Therefore, we can simply use pd.get_dummies() function again to convert the values of this column into one-hot format.
sex_one_hot = pd.get_dummies(df['Sex'], prefix='Sex')
df = pd.concat([df, sex_one_hot], axis=1)
If I were to say, this Age feature engineering is the most tricky part — well, at least for me. According to my previous article which talks about EDA on this Titanic dataset, we found that 177 out of 889 passengers’ age are missing. Therefore, we need to fill this with a number. However, in this case we will not just directly fill those NaNs with the median or mean of all existing age numbers. Instead, I wanna group all passengers data by its Title first, and then compute the median of each title group before eventually use these medians to fill the missing values. Here’s the first thing to do:
age_median = df.groupby('Title')['Age'].median()
After running the code above, we are going to obtain the median of each Title.
the Countess 33.0
Name: Age, dtype: float64
Next, we need to create a function fill_age() which accepts a single value as its parameter. This x parameter basically just represents every row in our data frame.
for index, age in zip(age_median.index, age_median.values):
if x['Title'] == index:
Now it’s time to apply this fill_age() function. However though, we need to be careful since essentially what we need to do is to replace only the missing Age, not the entire values in Age column. Therefore, I define a lambda function inside of apply() method. What’s actually done by the lambda function itself is that we are going to apply the fill_age() function only when the corresponding age is missing. Otherwise, if the age value already exists, then we will just use its existing value. Below is how to do it:
df['Age'] = df.apply(lambda x: fill_age(x) if np.isnan(x['Age']) else x['Age'], axis=1)