Part 1: Titanic — Basic of Data Analysis

Riikka Kokko
The Startup
Published in
4 min readAug 6, 2020

My goal was to get a better understanding of how to work with tabular data so I challenged myself and started with the Titanic -project. I think this was an excellent way to learn the basics of data analysis with python.

You can find the competition here: https://www.kaggle.com/c/titanic
I really recommend you to try it yourself if you want to learn how to analyze the data and build machine learning models.

I started by uploading the packages:

import pandas as pd 
import numpy as np
import
matplotlib.pyplot as plt
import
seaborn as sns

Pandas is a great package for tabular data analysis. Numpy provides a high-performance multidimensional array object and tools for working with these arrays. Matplotlib packages help you to generate plots, histograms, power spectra, bar charts, etc., with just a few lines of code. Seaborn is developed based on the Matplotlib library and it can be used to create attractive and informative statistical graphics.

After loading these packages I loaded the data:

df=pd.read_csv("train.csv")

Then I had a quick look at the data:

df.head()
#This prints you the first 5 rows of the table
#If you want to print 10 rows of the table instead of 5, then use
df.head(10)
Screenshot of the first rows
df.tail()
# This prints you out the last five rows of the table

I recommend starting with a look at the data so that you can be sure everything is as it should be. This is how you can avoid stupid mistakes in further analysis.

df.shape
#This prints you the number of rows and columns

It is a good habit to print out the shape of the data in the beginning so you can check the number of columns and rows and be sure you haven’t missed any data during the analysis.

Analyze the data

Then I continued to look at the data by counting the values. This gave me a lot of information about the content of the data.

df['Pclass'].value_counts()
# Prints out count of classes values
The number of persons in each class. 3rd class was the most popular.

I prefer using percentages to showcase values. It is easier to understand the values in percentages.

df['Pclass'].value_counts(normalize=True)
# same as above just that using "normalize=True" value is printed in percentages
55% of people were in 3rd class

I counted values for each column separately. In the future, I challenge myself to do the function which prints out values but it was not my scope in this project.

I wanted to understand also the values of different columns so I used the describe() method for that.

df['Fare'].describe()
# describe() is used to view basic statistical details like count, mean, minimum and maximum values.
“Fare” column values

Here you can see for example that the minimum price for the ticket was 0,00 $ and the maximum price was 512,33 $.

I did several crosstables to understand which were the determinant values for the surviving.

pd.crosstab(df['Survived'], df['Sex'])
# crosstable number of sex based on surviving.
Here I also recommend using percentages instead of numerical values
pd.crosstab(df['Survived'], df['Sex'], normalize=True)
# Using "normalize=True", you get values in percentage.
Same as above just in percentages

Doing crosstables with different values gives you information about the possible correlations between the variables, for example, sex and surviving. As you can see, 26% of women survived and most of the men, 52%, didn’t survive.

Visualize the data

It is nice to have numerical values in tables but it is easier to understand the visualized data, at least for me. This is why I plotted histograms and bar charts. By creating histograms and bar charts I learned how to visualize the data. Here are a few examples:

df.hist(column='Age')
In this histogram, you can see that passengers were mostly 20–40 years old.

I used seaborn library for the bar charts.

sns.countplot(x='Sex', hue='Survived', data=df);
More females survived than males.

Also, I used a heatmap to see the correlation between different columns.

corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, annot=True, square=True, annot_kws={'size': 15});

Heatmap shows that there is a strong negative correlation between Fares and Classes, so that when one increases other decreases. It is logical because ticket prices in the 1st class are higher than in the 3rd class.

If we focus on analyzing the correlations between surviving and other values, we see that there is a strong positive correlation between surviving and fare. The probability to survive is higher when the ticket price has been higher.

You can find the project in Github. please feel free to try it yourself and comment if there is something that needs clarifying!

Thank you for the highly trained monkey (Risto Hinno) for motivating and inspiring me!

--

--