Day 27 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
4 min readJul 13, 2020

Kaggle Problem. So i decided to take a problem and start working on it and maybe share my explanation on why I did this and since even I’m a beginner to Kaggle so I thought of doing a very basic problem. The link to the problem is given below:

Let’s start by importing all the libraries using the given commands.

#data analysis libraries 
import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Lets import our dataset into the terminal. In case you are using your own terminal, make sure to download the dataset and fix the path of the dataset. Since I’m using Kaggle, you can follow the syntax below:

#import train and test CSV files
train_data = pd.read_csv("../input/titanic/train.csv")
test_data = pd.read_csv("../input/titanic/test.csv")

Next thing that I’m going to do is to print the head of my training dataset in order to understand my dataset.

Train data head

Next we are going to move to the data analysis phase where we understanding our data and visualizing it. Let see the columns we have

#get a list of the features within the dataset
print(train_data.columns)
Columns of train dataset

Next, lets see the overall features and statistics of the columns using the given syntax. Make sure to note that the include=’all’ is for mentioning all the statistical parameters. You can do without it and see the difference.

train_data.describe(include='all')
Describing the train dataset

Now we need to check for anomalies or missing values in the dataset which can be done using the following commands.

#check for any other unusable values
print(pd.isnull(train_data).sum())
Counting the null values.

From this we can see that Cabin column has the most number of anomalies so it makes more sense to drop the column during the training phase later in the future while separating our features for our model.

Now, since we have a rough idea about our dataset, lets move to the visualization stage.

Since it is a Titanic dataset, let us try to visualize the ratio of men to women who actually survived using a graph in seaborn. Check the syntax given below and I shall mention the meaning of the syntax right after:

#draw a bar plot of survival by sex
sns.barplot(x="Sex", y="Survived", data=train_data)
#print percentages of females vs. males that survive
print("Percentage of females who survived:", train_data["Survived"][train_data["Sex"] == 'female'].value_counts(normalize = True)[1]*100)
print("Percentage of males who survived:", train_data["Survived"][train_data["Sex"] == 'male'].value_counts(normalize = True)[1]*100)
  • barplot: This is used for creating a bar graph and the x and y columns in it are actual column names from the dataset
  • value_counts: This is used in order to calculate the number of unque such values exist from that exact given column. This is why we have initialized normalize=True. The given syntax is used to count for males and females in the population.
Output of survivors on titanic

The following graph and implementation show the survivors on the titanic whose details are obtained from the dataset given.

In case, you have any queries regarding the syntax, check the given link:

In the following way, a number of visuals can be identified and created from the dataset. In the following blog, I shall mention about the feature extraction and the models used for training. But for visualization, there are a number of plots you can create, I just wanted to put the idea and methodology out there.

Thanks for reading. Keep Learning.

Cheers.

--

--