Monday, March 8, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Neural Networks

Titanic Survival Dataset Part 1/2: Exploratory Data Analysis | by Muhammad Ardi | Aug, 2020

August 28, 2020
in Neural Networks
Titanic Survival Dataset Part 1/2: Exploratory Data Analysis | by Muhammad Ardi | Aug, 2020
585
SHARES
3.2k
VIEWS
Share on FacebookShare on Twitter

According to the table above, it shows that the values of Survived column are either 0 or 1, where 0 represents that the passenger is not survived while 1 says that they are survived. Now in order to find out the number of the two, we are going to employ groupby() method like this:

survived_count = df.groupby('Survived')['Survived'].count()
survived_count

Here’s how to read it: “Group the data frame by values in Survived column, and count the number of occurrences of each group.”

You might also like

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

In this case, since the Survived only has 2 possible values (either 0 or 1), then the code above produces two groups. If we print out survived_count variable, it will produce the following output:

Survived
0 549
1 342
Name: Survived, dtype: int64

Based on the output above, we can see that there are 549 people who were not survived. To make things look better, I wanna display these numbers in form of graph. Here I will use bar() function coming from Matplotlib module. The function is pretty easy to understand. The two parameters that we need to pass is just the index name and its values.

plt.figure(figsize=(4,5))
plt.bar(survived_count.index, survived_count.values)
plt.title('Grouped by survival')
plt.xticks([0,1],['Not survived', 'Survived'])
for i, value in enumerate(survived_count.values):
plt.text(i, value-70, str(value), fontsize=12, color='white',
horizontalalignment='center', verticalalignment='center')
plt.show()

And here is the output:

Number of survived and not survived passengers.

Now I will also do the similar thing in order to find out the number of survived persons based on their gender. Notice that here I use sum() instead of count() because we are only interested to calculate the number of survived passengers which are represented by number 1. So it’s kinda like adding 1s in each group.

survived_sex = df.groupby('Sex')['Survived'].sum()plt.figure(figsize=(4,5))
plt.bar(survived_sex.index, survived_sex.values)
plt.title('Survived female and male')
for i, value in enumerate(survived_sex.values):
plt.text(i, value-20, str(value), fontsize=12, color='white',
horizontalalignment='center', verticalalignment='center')
plt.show()
Number of survived females and males.

Well, I think the graph above is pretty straightforward to understand 🙂

Next, I wanna find out the distribution of ticket classes where the attribute is stored at Pclass column. The way to do it is pretty much similar to the one I created earlier.

pclass_count = df.groupby('Pclass')['Pclass'].count()

Now that there are 3 values stored in pclass_count variable in which each of those represents the number of tickets in each class. However, instead of printing out a graph here I prefer to display it in form of pie chart using pie() function.

plt.figure(figsize=(7,7))
plt.title(‘Grouped by pclass’)
plt.pie(pclass_count.values, labels=[‘Class 1’, ‘Class 2’, ‘Class 3’],
autopct=’%1.1f%%’, textprops={‘fontsize’:13})
plt.show()
Ticket class distribution shown in percent.

Furthermore, we can also display gender and embarkation distribution pie chart using the exact same method.

Gender distribution shown in percent.
Embarkation distribution shown in percent.

Another thing that I wanna find out is the age distribution. Before I go further, remember that our Age column contains 177 missing values out of 891 data in total. Therefore, we need to get rid of those NaNs first. Here’s my approach to do it:

ages = df[df['Age'].notnull()]['Age'].values

What I am actually doing in the code above is just to retrieve all non-NaN age values and then store the result to ages Numpy array. Next, I will use histogram() function taken from Numpy module. Notice that here I pass two arguments to the function: ages array and a list of bins.

ages_hist = np.histogram(ages, bins=[0,10,20,30,40,50,60,70,80,90])
ages_hist

After running the code above, we should get the following output:

(array([ 62, 102, 220, 167,  89,  48,  19,   6,   1], dtype=int64),
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90]))

It’s important to know that the output value of np.histogram() function above is a tuple with 2 elements, where the first one holds the number of data in each bin while the second one is the bins itself. To make things clearer in the figure, I will also define labels in ages_hist_labels.

ages_hist_labels = [‘0–10’, ‘11–20’, ‘21–30’, ‘31–40’, ‘41–50’, ‘51–60’, ‘61–70’, ‘71–80’, ‘81–90’]

And finally we can show the histogram like this:

plt.figure(figsize=(7,7))
plt.title('Age distribution')
plt.bar(ages_hist_labels, ages_hist[0])
plt.xlabel('Age')
plt.ylabel('No of passenger')
for i, bin in zip(ages_hist[0], range(9)):
plt.text(bin, i+3, str(int(i)), fontsize=12,
horizontalalignment='center', verticalalignment='center')
plt.show()
Age distribution.

If we pay attention to our Cabin column, we can see that all non-NaN values are always started with a capital letter which then followed by several numbers. This can be checked using df[‘Cabin’].unique()[:10] command. Here I only return the first 10 unique values for simplicity.

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
'C23 C25 C27', 'B78'], dtype=object)

I got a feeling that probably these initial letters might contain something important, so then I decided to take it and leave the numbers. In order to do that, we need to create a function called take_initial().

def take_initial(x):
return x[0]

The function above is pretty straightforward though. The argument x essentially represents a string of each row in which we will return only its initial character. Before applying the function to all rows in the Cabin column, we need to drop all NaN values first and store it in cabins object like this:

cabins = df['Cabin'].dropna()

Now as the null values have been removed, we can start to apply the take_initial() function and directly updating the contents of cabins:

cabins = cabins.apply(take_initial)

Next we will use value_counts() method to find out the number of occurrences of each letter. I will also directly store its values in cabins_count object.

cabins_count = cabins.value_counts()
cabins_count

After running the code above we are going to see the following output.

C    59
B 47
D 33
E 32
A 15
F 13
G 4
T 1
Name: Cabin, dtype: int64

Finally, to make things look better, I will use plt.bar() again to display it in form of bar chart.

plt.title('Cabin distribution')
plt.bar(cabins_count.index, cabins_count.values)
plt.show()
Cabin distribution.

Fare attributes might also play an important role to predict whether a passenger is survived. Different to the previous figures, here instead of using bar or pie chart, I will create a boxplot. Fortunately, it’s extremely simple to do that as basically it can be shown just by using plt.boxplot() function.

plt.figure(figsize=(13,1))
plt.title(‘Fare distribution’)
plt.boxplot(df[‘Fare’], vert=False)
plt.show()

Credit: BecomingHuman By: Muhammad Ardi

Previous Post

How to Talk to Customers During a Crisis

Next Post

Pandemic Presents Opportunities for Robots; Teaching Them is a Challenge 

Related Posts

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021
Neural Networks

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

March 6, 2021
Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021
Neural Networks

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

March 6, 2021
5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021
Neural Networks

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

March 6, 2021
Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021
Neural Networks

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

March 5, 2021
8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021
Neural Networks

8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021

March 5, 2021
Next Post
Pandemic Presents Opportunities for Robots; Teaching Them is a Challenge 

Pandemic Presents Opportunities for Robots; Teaching Them is a Challenge 

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Here’s an adorable factory game about machine learning and cats
Machine Learning

Here’s an adorable factory game about machine learning and cats

March 8, 2021
How Machine Learning Is Changing Influencer Marketing
Machine Learning

How Machine Learning Is Changing Influencer Marketing

March 8, 2021
Video Highlights: Deep Learning for Probabilistic Time Series Forecasting
Machine Learning

Video Highlights: Deep Learning for Probabilistic Time Series Forecasting

March 7, 2021
Machine Learning Market Expansion Projected to Gain an Uptick During 2021-2027
Machine Learning

Machine Learning Market Expansion Projected to Gain an Uptick During 2021-2027

March 7, 2021
Maza Russian cybercriminal forum suffers data breach
Internet Security

Maza Russian cybercriminal forum suffers data breach

March 7, 2021
Clinical presentation of COVID-19 – a model derived by a machine learning algorithm
Machine Learning

Clinical presentation of COVID-19 – a model derived by a machine learning algorithm

March 7, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Here’s an adorable factory game about machine learning and cats March 8, 2021
  • How Machine Learning Is Changing Influencer Marketing March 8, 2021
  • Video Highlights: Deep Learning for Probabilistic Time Series Forecasting March 7, 2021
  • Machine Learning Market Expansion Projected to Gain an Uptick During 2021-2027 March 7, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates