Basic Exploratory Data Analysis of Titanic Data Using R

Iyarace Khampakdee
The Startup
Published in
7 min readJan 29, 2021

For those of you who are getting into data analysis especially using R programming language, you might wonder that “where should I start?” or “what should I do?” once you finished installing your R and RStudio.

This will pretty much be a very basic tutorial of how to start your first ever exploratory data analysis as I am storytelling you about the first journey of my basic data analysis. If you wanted to follow along feel free to do so but first you need to download the dataset from https://www.kaggle.com/c/titanic/data. I am using the training dataset (train.csv). Do not forget to put the file into your working directory.

The source code used in this tutorial is already uploaded to my public GitHub https://github.com/nayiyarace/Basic-Exploratory-Data-Analysis-in-R.

Getting to Know Your Data

Data analysis is a very big word and quite a broad thing to talk about. According to what you are interested in the data, you will probably have to focus on the analysis that benefits your idea instead of doing every kind of analysis on earth. The problem is that you have to idea what is interested you just yet.

This “getting to know your data” section can be used pretty much in every scenario when you first get your data but have a completely blank mind on what to do with it. The steps are not as complicated and flexible depends on the dataset itself. Some datasets might need more or fewer things to be done so, feel free to adjust them. So, get your dev tool ready, create your project, don’t forget to put your dataset into the working directory, and let's get to know your data.

Use read.csv(#filepath) to read the training data set. There is no need to get fancy about the classes for each column. For now, being able to load the file into the working environment is already good enough.

The very first basic exploration is to see the data yourself. Use head and tail to see how the data looks like. The head function tells you the first 6 rows of the data and the tail function tells you the last 6. This will help you spot your data field of interest and know what kind of data analysis you can do later.

Take a look and see what caught your eyes. Now, it is good to think about what might be a thing that makes the passenger survive from Titanic incident. For me, Sex, Age, and SibSp (Number of Siblings/Spouses Aboard) look very interesting.

Use summary(#object) to run a basic summary on the dataset.

Some summaries might not answer the right question and probably look quite odd. It is maybe due to the wrong data class it was read in.

Use sapply(#object, class) to check the class of every column.

As you can see, “Survived” is an integer, and “Sex” is characters. To be able to run a good summary, the classes need to be changed.

This will change the class of the column“Survived” and “Sex” into factors that will also change how it will be summarised. Now, do the same thing again. Run sapply to make sure that classes have changed and you are clear to run summary function.

Now the summary looks pretty much fine. You can begin to make sense of what the table means to you. I recommended writing a couple of sentences describe the table. If you are still new to this you can start off by doing just some regular observations. You can see my example.

“There is a total of 891 passengers. 342 people survived but 549 didn’t make it. The ratio of survived and dead is 38:61 means that the number of survivors is just above half of the number of deaths. There are 577 males and 314 females. The average is about 29–30. The maximum age is 80 but the median is 28 and the third quartile is 38 means that even though there are older people but not as much compared to the others. (It will become clearer when plotting the distribution) The median for SibSp is 0 means that more than half of the passengers do not have siblings or spouses on the ship.”

Now I roughly have the idea of what I wanted to do to the dataset. I will divide the dataset into two. One will contain on the survivor and another will only contain those who did not survive.

Prepare Your Data

Before doing anything there is actually one more check-up to be done. It is to find if there is any missing data.

is.na(#object) will check if the data is NA or not and return the result as true or false. You can also do sum(is.na(#object)) to count how many the NA data actually is.

Missing data (in this case it is NA) might disturb some analysis so, I am going to exclude the row that has missing data on it.

The script will dropout any row that has missing data on it remaining with only the untouched rows and save them into another object called titanic_train_dropedna. This way I can keep both the original dataset and also the modified dataset in the working environment.

Let’s separate the survivor and nonsurvivor data from the modified dataset. Now we are ready to make some graphs.

For graph plotting, I will not go in too deep in too much detail such as how to change the name of the axis or the color of the graph. I wanted to focus on how to use the function to plot the graph and more importantly to interpret what the graph has created.

Bar chart

As you can see, the survival rate of females is approximately double of males. This might be due to the fact that they gave more priority to help females.

Histogram

The survival rate of passengers younger than 10-year-old is higher. This might be due to that they get more help from others and their guardians or parents. Meanwhile, the death rate of passengers who were around 20 to 40 is very high. These people did not only have to help themselves but also possibly tried to help other passengers too.

Even though, I did interested in the idea that the passenger who has their family member with them might have more survival rate. According to the graph, the size of the family might not be the case. There are quite a lot of passengers who have a family size of 4 but still did not survive the incident. But the idea of having at least one family member might still be relevant. The frequency of the passengers who have one sibling or sprout, in survivors and nonsurvivors, the numbers are quite the same. The ratio is about 100:90 but if you look at the numbers of passengers who did not have any sibling or sprout, the ratio is pretty much different. It is about 170:300 (survived to not survived). So, the idea of not having any family member around to help them is quite affecting their survival rate.

There you go. This is pretty much it for my exploratory data analysis. There is so much more thing you can do to the dataset. It is going to be up to your interest and your wonderful idea. You can do whatever you want once you get a grip on this. This is not much but I think this is just what I wanted which is not to be too formal and go too hard on you.

Hope you enjoyed my tutorial,

Iyarace Khampakdee

--

--