Machine Learning and Data Analysis with Python, Titanic Dataset: Part 1, Visualization

Quinn Wang
Analytics Vidhya
Published in
7 min readFeb 14, 2020

Every great machine learning and data science project starts with defining the problem: What data do you have to work with and what are you trying to achieve with those data?

Therefore, a common first step is to explore the data through visualization or descriptive analysis. In this part of the tutorial, I’m going to take you through how to understand the data in order to predict who survived and who did not in the Titanic disaster. I also made a video on this so if you learn better with visual and audio guides, or would like to listen to more of the reasoning for why each step is taken, feel free to skip to the bottom of this post and you will see a link to the video.

Now let’s get started!

Overview of the dataset

The first step is to figure out what are our X and y, and how are they correlated. X includes the features we have to work with. For example in the Titanic dataset, each datapoint describes a passenger, and it’s got features such as name, age, sex, cabin number, etc. These features are then represented by a vector in X. X is basically an aggregation of these features vectors from all passengers, which then forms a matrix.

Corresponding to each feature vector in the X matrix, there is a label which we are trying to predict for. We call this label y. An aggregation of all labels from all vectors in X gives us the final vector y.

To takes this into the semantics of the Titanic example, X is a matrix made up of feature vectors describing each passenger, and y is the corresponding outcome of whether or not that passenger survived. Since the outcome can only be 1 (survived) and 0 (didn’t survive), this is called a binary classification problem.

What we are trying to find is what features in X corresponds to a survival in the label. When we look at the X and the feature descriptions from the picture above, we have this feature Sex. Based on intuition we know this will be a very important feature as most of the people who survived in the Titanic are female passengers. We also have “Pclass”, where 1 represents a first class passenger and 3 represents a third class passenger. That is also very good to have because people from higher class probably had a better chance to be rescued first. So we already have some features with good predictive powers, let’s now get into the code. In part 0 of this tutorial I went through how to set up the environment we are going to use, so if you don’t have a working Jupyter Notebook environment you might want to check out:

First we are going to use Pandas to read the data files we downloaded from Kaggle.

Reading in the training data

pd.read_csv({your_file_directory}) is going to read in a .csv file as a Pandas dataframe. We are going to store this dataframe in a Python variable called df.

Let’s look at the distribution of survived vs. dead:

Total survival distribution

.value_counts(), when applied on a column (the technical term is a Pandas series) is a built-in Pandas method that returns the frequency of each unique value in the series. .plot(kind=‘bar’) will display this frequency information in a bar graph.

Just knowing the total distribution of survival vs. deaths is not that interesting. We want to know what attributes describe the people who survived. For example, if we want to look at survival distribution of male vs. female. We can define a condition (you may hear people call this a mask) like df[“Sex”]==‘male’, which returns a boolean value for each row in the dataframe based on whether or not the sex entry in the row meets the condition. If we use this condition to index (pass it in the square brackets) on dataframe, we will be left with a filtered dataframe that only contains rows that satisfy this condition. Therefore, when we .value_counts() again on the column it’s going to show survival distribution on only the male or only the female passengers:

Male survival distribution

Compared to the total survival distribution, the ratio of survival is smaller for the male passengers.

Female survival distribution

Compared to the total survival distribution, the ratio of survival (label 1) is much bigger. It is more than twice as large as the death population whereas in the total distribution more people died than survived.

We can do the same thing with Pclass. Let’s first take a look at what values this Pclass variable can take:

Frequency of each value in Pclass

And the survival distribution for each class respectively is:

Survival distribution for 1st class passengers

First class passengers had a better chance of survival than death.

Survival distribution for 2nd class passengers

For second class passengers this ratio is slightly worse but they still had a fair chance to survive.

Survival distribution for 3rd class passengers

And for third class passengers this ratio is not looking too hot.

We might already have started to see a trend: female passengers in a higher class tends to survive. Let’s validate that by using the 2 condition with an AND. Putting each condition in a bracket and & in between will filter for rows in the dataframe that meets both conditions:

Survival distribution for 1st class female passengers

The skew towards 1 is very apparent. So statistically speaking, Rose is very much expected to survive.

On that note, let’s look at survival statistics for passengers like Jack:

And unfortunately…

Graphs are a great presentation technic but when you are doing analysis yourself, maybe you just want to look at numerical values. So from a numerical presentation standpoint, let’s look at the chances of survival for each combination of Sex and Pclass using technics we just learned:

Chances of survival for each (Sex, Pclass) combination

Now before we go any further, I want to point you to something that’s been staring at us for a while: what are those NaN entries?

These bad boys stands for Not a Number, and are what we call missing data. We will have to deal with these missing data because a lot of the models wouldn’t accept data with missing values. 2 ways to tackle this:

  • Drop all rows with a missing value
  • Fill in the missing entries to some reasonable value

We can go with the first approach if say we have 100,000 datapoints and there are only 10 rows with a missing value. However, this isn’t the case with the Titanic dataset. .dropna() will return a the original dataframe with only rows without missing values:

Effect of dropping rows with NaN

After df.dropna(), most of our training data are gone, which is obviously not ideal especially with such a small training set to start with. So we want to go with the second approach.

Let’s use the Age column as an example.

Missing values in Age column

The naive approach is to fill all these values with 0, but that doesn’t really makes sense. For the first iteration, we can just fill these entries with some average values:

Filling in missing Age values as an average

df.loc[{some condition}, {a column name}] = is going to set the values in the specific column for rows in the dataframe that satisfies the conditions to whatever is on the right side of the equation. In this case, we are setting the Age column in for all rows in the dataframe that are missing an Age entry and Sex is male to the average age of all the male passengers with non-missing Age entries. Then same with the female passengers.

Now there wouldn’t be any missing values in the feature column Age.

In part 2 of this series I’m going to some more preprocessing and then move on to building a baseline model. Stayed tuned!

Now the link to the video as promised a couple minutes ago…

More details on what we talked about above

--

--

Quinn Wang
Analytics Vidhya

Data analyst with an interest in machine learning. Passionate about understanding the theoretical backings of ML algorithms.