Let’s unravel the mysteries of ‘The Unsinkable’…

Exploratory Data Analysis — A Case Study on Titanic Data set (Part-1)

Preetam B A
6 min readAug 13, 2020

Imagine your group of friends have decided to spend the vacations by travelling to an amazing destination. And you have been given the responsibility to find one. Interesting? However interesting it may seem, choosing a single location for which everyone agrees is still a hectic task. There are various factors that need to be considered while choosing the location. The cost of the travel journey to the location should fit everyone's pockets. There should be proper accommodation options. The remoteness of the location, activities available, the best time to visit the locations and so on.

To get the information about all these factors you have to search on the internet and get information from various sources. After getting all these information, you then have to compare and find a right trade-off between the factors of these locations. The very same activity of gathering, understanding and comparing the data(can be in a visual representation) to help make better decisions is known as Exploratory Data Analysis.

In this article we’ll be Exploring the data of the legendary — ‘Titanic’. We will be using ‘Titanic: Machine Learning from Disaster’ data-set on Kaggle to dive deep into it. The main objective, however, is to predict the survival of the passengers based on the attributes given, here, we’ll be exploring the data-set to find the Hidden Story which was covered along with Sinking of the Titanic. We’ll be unraveling some amazing mysteries behind the sinking of Titanic which you hardly might have heard of. So, get in your detective hat and magnifying glass, ‘coz we’ll be exploring History’s one of the most interesting data till date!

The basic requirements for this project will be a basic understanding about the Python language along with some basic plotting library of Matplotlib, Seaborn etc. If you don’t know any of these, it’s completely alright ‘coz at the end of this article you will be having a basic understanding about it. I’ll be recommending to use ‘Jupyter notebook’ or the simpler and the better option- ‘Google Colab’. Know how to setup your Google Colab by following the simple steps mentioned in this link — https://www.geeksforgeeks.org/how-to-use-google-colab/. After opening and setting up the Google Colab, it’s time to bring your inner Data Scientist out :)

I have also provided the link to the Google Colab Notebook having all these code.

To import the data from Kaggle into Google Colab follow these steps—

  1. Go to your account, Click on Create New API Token — It will download kaggle.json file on your machine.
  2. Go to your Google Colab project file and run the following commands:

Choose the kaggle.json file that you have downloaded

Make directory named kaggle and copy kaggle.json file into it.

Change the permissions of the file and download the data-set

Make sure the Drive is Mounted and appropriate folders are created . Here in my case folders Projects>datasets were created where we’ll be moving the downloaded data-set to avoid importing from kaggle repeatedly.

Congratulations, you’ve successfully imported the data-set from Kaggle and stored it into your Google Drive. Next time whenever we need the data-set we can do so by simply copying the path of the file and loading it.

Now, let’s start by importing the required libraries

Before jumping into EDA let’s first load the data-set and have quick glance on it. Below are the libraries that we may require.

The data can be loaded in the format of Pandas’ Dataframe as follows and ‘data.shape’ print the dimensions of the data where 891 represents the number of records and 12 represents the number of attributes. Now the path to ‘train.csv’ may vary as per your file’s location. You may directly copy and paste the path to the file in ‘Files’ section right below the ‘Table of Contents’ section (As on year 2020).

Output:

Viewing an entire data-set at once can be confusing. So, let’s view some sample of the data. ‘data.head()’ gives the ‘starting 5’ and ‘data.tail()’ gives the bottom 5 records/rows of the dataframe based on the index of the row.

Output:

data.head()
data.tail()

Now, let’s print the columns of the dataframe.

Output:

‘data.info()’ gives information about each attribute and the count of non-null/ non-missing values in each attribute and its datatype. As you can see in the output, the attributes, ‘Age’, ‘Cabin’ and ‘Embarked’ have some missing values present in them.(The processing of these missing values will be done in later modules.)

Output:

If you have numerical data in the data-set, ‘data.describe()’ can be used to get count, standard deviation, mean and five number summary i.e minimum, 25%(Q1), 50%(median), 75%(Q3) and maximum of each attribute.

Credits: www.statisticshowto.com

Output:

Understanding the data

Okay, so we’ve seen the samples of the data. But what does each of the attributes denote. The description of the attributes are provided in the Kaggle itself. But I’ll try to explain it here to get a better gist of it.

There are a total of 891 instances, each consisting of 12 attributes. So here’s a brief information about what the data consist of-

  1. Passenger Id: A unique id given for each passenger in the data-set.

2. Survived: It denotes whether the passenger survived or not.

Here,

  • 0 = Not Survived
  • 1 = Survived

3. Pclass: Pclass represents the Ticket class which is also considered as proxy for socio-economic status (SES)

Here,

  • 1 = Upper Class
  • 2 = Middle Class
  • 3 = Lower Class

4. Name: Name of the Passenger

5. Sex: Denotes the Sex/Gender of the passenger i.e ‘male’ or ‘female’.

6. Age: Denotes the age of the passenger

Note: If the passenger’s a baby then it’s age is represented in fraction. e.g. 0.33. If the age is estimated, is it in the form of xx.5. e.g. 18.5

7. SibSp: It represents no. of siblings / spouses aboard the Titanic

The data-set defines family relations in this way…

  • Sibling = brother, sister, stepbrother, stepsister
  • Spouse = husband, wife (mistresses and fiances were ignored)

8. Parch: It represents no. of parents / children aboard the Titanic

The dataset defines family relations in this way…

  • Parent = mother, father
  • Child = daughter, son, stepdaughter, stepson
  • Some children travelled only with a nanny, therefore parch=0 for them.

9. Ticket: It represents the ticket number of the passenger

10. Fare: It represents Passenger fare.

11. Cabin: It represents the Cabin No.

12. Embarked: It represents the Port of Embarkation

Here,

  • C = Cherbourg
  • Q = Queenstown
  • S = Southampton

Okay, so now that we have understood the data, let’s hop on to understand the relation between each of the attributes and understand what factors played a major role in the Survival of a Passenger and to also predict if you were in the Titanic, would you have survived or not? Click on the Link to the next story to find out!

Link to the Notebook: Click Here

Link to Part 2 of the Blog: Click Here

--

--