Titanic Dataset (Top 7%) EDA and Prediction
Dataset Overview 🚢
The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class ©, their sex (G), and the fare they paid (X).
The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, you have to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
My Submission
Steps I used in this kernel :
- Import Libraries And read Files
- Basic Insights of Data
- Feature Engineering
- Imputation ( Numerical Encoding, Handling Outliers, Binning, Data Visualization)
- Modeling ( TRAIN-TEST SPLIT, Model Implementation)
- Saving CSV
1. Importing libraries and read files
In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifierimport warnings
warnings.filterwarnings("ignore")
In [2]:
data = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
ss = pd.read_csv('../input/titanic/gender_submission.csv')
- Now let’s print out some random rows from the dataset using “sample()” function.
In [3]:
data.sample(5)
Let’s have a look over the first 5 and last 5 rows of the dataset
Top 5 values of the dataset
In [5]:
data.head()
Last 5 values of the dataset
In [6]:
data.tail()
2. Basic insights of data
In this part we will go through the shape of the dataset, what columns it has and the theoritical and statistical summary of the dataset
In [7]:
data.shape
Out[7]:
(891, 12)
In [8]:
data.columns
Out[8]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Theoretical information about data =>
In [9]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Statistical information about data =>
In [10]:
data.describe()
Out[10]:
3. Feature Engineering
3.1 Imputation
Removing null/missing values
We are removing these null values because they adversely affect the performance and accuracy of any machine learning algorithm. So, removing null values from the dataset before modeling is one of the important steps in data wrangling.
In [11]:
data.isnull().sum()
Out[11]:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
We have a Cabin column with 687 out of 891 values as null so we will drop it.
In [12]:
data.drop('Cabin',axis=1,inplace=True)
Now we left with Embarked with 2 values as null and Age with 177 values as null so we will replace them with suitable mode and mean.
In [13]:
data['Embarked'].fillna(value = data['Embarked'].mode,inplace=True)
data['Age'].fillna(value = data['Age'].mean(),inplace = True)
In [14]:
data.isnull().sum()
Out[14]:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
dtype: int64
So here we are all sorted with null values.
In [15]:
data.drop('Embarked',axis=1,inplace=True)
data.drop('Name',axis=1,inplace=True)
In [16]:
data.columns
Out[16]:
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare'],
dtype='object')
3.2 Numerical Encoding
In [17]:
temp = pd.read_csv('../input/titanic/train.csv') # all extra modifications will be done on it.
In [19]:
df_test.head()
Out[19]:
Creating a new dataframe temp with limited columns in it, then preparing the test set.
In [20]:
new_data = temp.drop('Ticket',axis=1)
df_test.drop('Ticket',axis=1,inplace=True)
new_data.head()
Out[20]:
Now we will just replace all null values from column “Age” by replacing them with the mode of the whole column. (most common age)
We will be using pandas function fillna() for this purpose.
In [21]:
mode_value = new_data['Age'].mode()
mode_t_value = df_test['Age'].mode()
mode_value,mode_t_value
Out[21]:
(0 24.0
dtype: float64,
0 21.0
1 24.0
dtype: float64)
In [22]:
new_data['Age'].fillna(value = 24.0,inplace = True)
df_test['Age'].fillna(value = 24.0,inplace = True)
Dropping the unwanted columns out of the final dataset
In [23]:
try:
new_data.drop('Name',axis=1,inplace=True)
df_test.drop('Name',axis=1,inplace=True)
except:
print("Name Already Dropped !")
try:
new_data.drop('Cabin',axis=1,inplace=True)
df_test.drop('Cabin',axis=1,inplace=True)
except:
print("Cabin Already Dropped !")
Now as we can see that the columns “Sex” is categorical but plays an important role in predicting the final prediction. So we will convert this column to a numerical column using pandas function get_dummies().
pandas.get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.
In [24]:
new_data['Sex'] = pd.get_dummies(new_data['Sex'])
df_test['Sex'] = pd.get_dummies(df_test['Sex'])
new_data.head()
Out[24]:
3.3 Handling Outliers
In [25]:
new_data.boxplot()
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3fea14f150>
From the above boxplot, we can see that some values of the Fare column are far away from the normal range of other data values. This means we have to check for the outliers in this column.
For this I will use 2 plotting methods, the first one is boxplot of that particular column and the second one is kdeplot() (used for distribution plot).
In [26]:
fig,axes = plt.subplots(1,2,figsize=(14,5))
sns.boxplot(new_data['Fare'],ax=axes[0]).set_title('Fare Box Plot Before',fontsize=18)
sns.kdeplot(new_data['Fare'],ax=axes[1]).set_title('Fare Distribution Plot Before',fontsize=18)
plt.show()
Here it’s clear that our data contains outliers and is not normalized. So let's remove these outliers by Interquartile Range.
What is Interquartile Range IQR?
IQR is used to measure variability by dividing a data set into quartiles. The data is sorted in ascending order and split into 4 equal parts.
Q1, Q2, Q3 called first, second and third quartiles are the values that separate the 4 equal parts:
Q1 represents the 25th percentile of the data.
Q2 represents the 50th percentile of the data.
Q3 represents the 75th percentile of the data.
And if a dataset has 2n / 2n+1 data points, then
Q1 = median of the dataset.
Q2 = median of n smallest data points.
Q3 = median of n highest data points.
IQR is the range between the first and the third quartiles namely Q1 and Q3: IQR = Q3 — Q1. The data points which fall below Q1–1.5 IQR or above Q3 + 1.5 IQR are outliers.
In [27]:
Q1=data['Fare'].quantile(0.10)
Q3=data['Fare'].quantile(0.80)
IQR= Q3-Q1
new_data.loc[(new_data['Fare'] > (Q3 + 1.5*IQR)),'Fare'] = Q3
new_data.loc[(new_data['Fare'] < (Q1 - 1.5*IQR)),'Fare'] = Q1t_Q1=df_test['Fare'].quantile(0.10)
t_Q3=df_test['Fare'].quantile(0.80)
t_IQR= t_Q3-t_Q1df_test.loc[(df_test['Fare'] > (t_Q3 + 1.5*t_IQR)),'Fare'] = t_Q3
df_test.loc[(df_test['Fare'] < (t_Q1 - 1.5*t_IQR)),'Fare'] = t_Q1print(Q1,Q3)
print(new_data.shape)7.55 39.6875
(891, 9)
In [28]:
fig,axes = plt.subplots(1,2,figsize=(14,5))
sns.boxplot(new_data['Fare'],ax=axes[0]).set_title('Fare Box Plot After',fontsize=18)
sns.kdeplot(new_data['Fare'],ax=axes[1]).set_title('Fare Distribution Plot After',fontsize=18)
plt.show()
Now we have all the outliers removed from the column Fare
3.4 Binning
The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance.
In [30]:
new_data['Age'].min(),new_data['Age'].max()
Out[30]:
(0.42, 80.0)
So we have an age range of (0.42year — 80years)
In [31]:
for i in new_data['Age'].values:
if i<1:
print(i)0.83
0.92
0.75
0.75
0.67
0.42
0.83
In [32]:
# AGE_RANGE
new_data['Age_Range'] = pd.cut(new_data['Age'], bins=[0,9,18,60,100], labels=["Child","Teenager","Adult", "Aged"])
df_test['Age_Range'] = pd.cut(df_test['Age'], bins=[0,9,18,60,100], labels=["Child","Teenager","Adult", "Aged"])
new_data.head()
Out[32]:
In [33]:
new_data = new_data.join(pd.get_dummies(new_data['Age_Range']))
df_test = df_test.join(pd.get_dummies(df_test['Age_Range']))
age_range = new_data['Age_Range'].values
new_data.drop('Age_Range',axis=1,inplace=True)
df_test.drop('Age_Range',axis=1,inplace=True)
new_data.head()
After creating bins for Age column we can now move to Embarked column
In [34]:
embarked = pd.get_dummies(new_data['Embarked'])
new_data = new_data.join(embarked)
new_data.drop('Embarked',axis=1,inplace=True)T_embarked = pd.get_dummies(df_test['Embarked'])
df_test = df_test.join(T_embarked)
df_test.drop('Embarked',axis=1,inplace=True)new_data.head()
In [35]:
new_data.drop('Parch',axis=1,inplace=True)
new_data.drop('SibSp',axis=1,inplace=True)df_test.drop('Parch',axis=1,inplace=True)
df_test.drop('SibSp',axis=1,inplace=True)
Performing the same task over the Pclass column
In [36]:
Pclass = pd.get_dummies(new_data['Pclass'])
Pclass.columns=['UpperClass', 'MiddleClass','LowerClass']
new_data = new_data.join(Pclass)
new_data.drop('Pclass',axis=1,inplace=True)Pclass = pd.get_dummies(df_test['Pclass'])
Pclass.columns=['UpperClass', 'MiddleClass','LowerClass']
df_test = df_test.join(Pclass)
df_test.drop('Pclass',axis=1,inplace=True)new_data.head()
Out[36]:
In [37]:
df_test.head()
Out[37]:
4. DATA VISUALIZATION
In [38]:
temp.head() # copy of train preprocessed, will be used for visualization
In [39]:
new_data.corr()
Out[39]:
In [40]:
fig,axes = plt.subplots(1,2,figsize=(20,7))
plt.suptitle('Orignal v/s Featured Data', fontsize=18)
sns.heatmap(temp.corr(),ax=axes[0]).set_title('Orignal Data')
sns.heatmap(new_data.corr(),ax=axes[1]).set_title('Featured Data')
plt.show()
- From above heatmap graphs it’s clear that now we have a lot more features to process and consider for our prediction model.
In [41]:
sns.pairplot(data)
plt.show()
In [42]:
fig,axes = plt.subplots(1,3,figsize=(15,5))
sns.distplot(new_data['Fare'],ax=axes[0])
sns.distplot(data['Age'],ax=axes[1])
sns.distplot(data['Pclass'],ax=axes[2])
plt.show()
In [43]:
fig,axes = plt.subplots(2,2,figsize=(16,10))
sns.scatterplot(new_data['Age'],new_data['Fare'],ax=axes[0,0])
sns.scatterplot(new_data['Age'],new_data['Survived'],ax=axes[0,1])
sns.barplot(new_data['Sex'],new_data['Age'],ax=axes[1,0])
sns.barplot(new_data['Sex'],new_data['Survived'],ax=axes[1,1])
plt.show()
In [44]:
new_data.head()
In [45]:
fig, ax = plt.subplots(1,2,figsize=(16,5))
sns.countplot(age_range,ax=ax[0]).set_title('Count of Age Group',fontsize=16)
sns.barplot(age_range,new_data['Survived'],ax=ax[1]).set_title('Survived v/s Age Group',fontsize=16)
Out[45]:
Text(0.5, 1.0, 'Survived v/s Age Group')
In [46]:
fig, ax = plt.subplots(1,2,figsize=(16,5))
sns.countplot(temp['Embarked'],ax=ax[0]).set_title('Count of Embarked',fontsize=16)
sns.barplot(temp['Embarked'],new_data['Survived'],ax=ax[1]).set_title('Survived v/s Embarked',fontsize=16)
Out[46]:
Text(0.5, 1.0, 'Survived v/s Embarked')
In [47]:
fig, ax = plt.subplots(1,2,figsize=(16,5))
sns.countplot(temp['Pclass'],ax=ax[0]).set_title('Count of Passenger Class',fontsize=16)
sns.barplot(temp['Pclass'],new_data['Survived'],ax=ax[1]).set_title('Survived v/s Passenger Class',fontsize=16)
plt.show()
5. MODELING
5.1 TRAIN-TEST SPLIT
In [48]:
new_data.isnull().sum()
Out[48]:
PassengerId 0
Survived 0
Sex 0
Age 0
Fare 0
Child 0
Teenager 0
Adult 0
Aged 0
C 0
Q 0
S 0
UpperClass 0
MiddleClass 0
LowerClass 0
dtype: int64
In [49]:
# we have new_data for training purpose and df_test for prediction So lets create testing data.
train = new_data
test = df_test
train.head()
Out[49]:
In [50]:
test.head()
Out[50]:
In [51]:
X = train.drop('Survived',axis=1)
y = train['Survived']
X = X.iloc[:,1:]
X.head(3)
Out[51]:
In [52]:
y.head(5)
Out[52]:
0 0
1 1
2 1
3 1
4 0
Name: Survived, dtype: int64
In [53]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.10,random_state=0)
X_train.shape,y_train.shape,X_test.shape,y_test.shape
Out[53]:
((801, 13), (801,), (90, 13), (90,))
5.2 Model Implementation </h1>
In [54]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Support-Vector Machine
In [55]:
linear_svc = LinearSVC()
linear_svc.fit(X_train, y_train)svc_pred = linear_svc.predict(X_test)linear_svc.score(X_train, y_train)acc_linear_svc = round(linear_svc.score(X_test, y_test) * 100, 2)
print(round(acc_linear_svc,2,), "%")
print(confusion_matrix(y_test,svc_pred))
80.0 %
[[47 4]
[14 25]]
In [56]:
print(classification_report(y_test,svc_pred))precision recall f1-score support 0 0.77 0.92 0.84 51
1 0.86 0.64 0.74 39 accuracy 0.80 90
macro avg 0.82 0.78 0.79 90
weighted avg 0.81 0.80 0.79 90
Random-Forest
Hyperparameters I used:
- max_features: max number of features considered for splitting the node
- n_estimatorsint, default=100: The number of trees in the forest.
- criterion{“gini”, “entropy”}: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
- max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_split: The minimum number of samples required to split an internal node.
Changed in version 0.18: Added float values for fractions.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. This may have the effect of smoothing the model, especially in regression.
max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
- If “auto”, then
max_features=sqrt(n_features)
. - If “sqrt”, then
max_features=sqrt(n_features)
(same as “auto”). - If “log2”, then
max_features=log2(n_features)
. - If None, then
max_features=n_features
.
In [57]:
random_forest1 = RandomForestClassifier(criterion='entropy', n_estimators=110,min_samples_split=40,min_samples_leaf=3, max_depth=5, max_features='auto',oob_score=True,random_state=1)random_forest1.fit(X_train, y_train)
rf_pred1 = random_forest1.predict(X_test)random_forest1.score(X_train, y_train)acc_rf1 = round(random_forest1.score(X_test, y_test) * 100, 2)
print(round(acc_rf1,2,), "%")
print(confusion_matrix(y_test,rf_pred1))81.11 %
[[49 2]
[15 24]]
In [58]:
print(random_forest1.oob_score_)
print(classification_report(y_test,rf_pred1))0.8227215980024969
precision recall f1-score support 0 0.77 0.96 0.85 51
1 0.92 0.62 0.74 39 accuracy 0.81 90
macro avg 0.84 0.79 0.80 90
weighted avg 0.83 0.81 0.80 90
Logistic Regression
In [59]:
LogReg = LogisticRegressionCV(cv=5)LogReg.fit(X_train,y_train)
LR_pred = LogReg.predict(X_test)LogReg.score(X_train, y_train)acc_LR = round(LogReg.score(X_test, y_test) * 100, 2)
print(round(acc_LR,2,), "%")
print(confusion_matrix(y_test,LR_pred))81.11 %
[[43 8]
[ 9 30]]
In [60]:
print(classification_report(y_test,LR_pred))precision recall f1-score support 0 0.83 0.84 0.83 51
1 0.79 0.77 0.78 39 accuracy 0.81 90
macro avg 0.81 0.81 0.81 90
weighted avg 0.81 0.81 0.81 90
Gradient Boost Classifier
In [61]:
from sklearn.ensemble import GradientBoostingClassifiergb = GradientBoostingClassifier()
gb.fit(X_train, y_train)# accuracy score, confusion matrix and classification report of gradient boosting classifiergb_acc = accuracy_score(y_test, gb.predict(X_test))print(f"Training Accuracy of Gradient Boosting Classifier is {accuracy_score(y_train, gb.predict(X_train))}")
print(f"Test Accuracy of Gradient Boosting Classifier is {gb_acc} \n")print(f"Confusion Matrix :- \n{confusion_matrix(y_test, gb.predict(X_test))}\n")Training Accuracy of Gradient Boosting Classifier is 0.898876404494382
Test Accuracy of Gradient Boosting Classifier is 0.8333333333333334 Confusion Matrix :-
[[49 2]
[13 26]]
XGBoost Classifier
XGB classifier hyperparameters
- n_estimators = no of trees created in XGB
- colsample_bytree = percentage of columns you want to select from a tree for helping to overfit and speeding up the process
- max_depth = depth of each tree
- alpha = learning rate ( used when getting the predicted values )
- lambda = regularization parameter
- gamma = it is a user-defined penalty (it encourages pruning the trees)
- min_child_weight = For regression, that is the minimum number of observations that go to a leaf. For classification, it is the minimum of the hessian
In [62]:
from xgboost import XGBClassifierxgb = XGBClassifier(booster = 'gbtree', learning_rate = 0.1, max_depth = 5, n_estimators = 180)
xgb.fit(X_train, y_train)# accuracy score, confusion matrix and classification report of xgboostxgb_acc = accuracy_score(y_test, xgb.predict(X_test))print(f"Training Accuracy of XgBoost is {accuracy_score(y_train, xgb.predict(X_train))}")
print(f"Test Accuracy of XgBoost is {xgb_acc} \n")Training Accuracy of XgBoost is 0.9338327091136079
Test Accuracy of XgBoost is 0.8444444444444444
Stochastic gradient boosting classifier
In [63]:
sgb = GradientBoostingClassifier(subsample = 0.90, max_features = 0.70)
sgb.fit(X_train, y_train)# accuracy score, confusion matrix and classification report of stochastic gradient boosting classifiersgb_acc = accuracy_score(y_test, sgb.predict(X_test))print(f"Training Accuracy of Stochastic Gradient Boosting is {accuracy_score(y_train, sgb.predict(X_train))}")
print(f"Test Accuracy of Stochastic Gradient Boosting is {sgb_acc} \n")print(f"Confusion Matrix :- \n{confusion_matrix(y_test, sgb.predict(X_test))}\n")Training Accuracy of Stochastic Gradient Boosting is 0.8901373283395755
Test Accuracy of Stochastic Gradient Boosting is 0.8222222222222222 Confusion Matrix :-
[[48 3]
[13 26]]
Voting Classifier
In [64]:
from sklearn.ensemble import VotingClassifierclassifiers = [('SVM',linear_svc), ('Random Forest', random_forest1), ('Logistic', LogReg),('Gradient Boost',gb),('XGBoost',xgb),(' sgb classifier',sgb)]
vc = VotingClassifier(estimators = classifiers)
vc.fit(X_train, y_train)
Out[64]:
VotingClassifier(estimators=[('SVM', LinearSVC()),
('Random Forest',
RandomForestClassifier(criterion='entropy',
max_depth=5,
min_samples_leaf=3,
min_samples_split=40,
n_estimators=110,
oob_score=True,
random_state=1)),
('Logistic', LogisticRegressionCV(cv=5)),
('Gradient Boost', GradientBoostingClassifier()),
('XGBoost',
XGBClassifier(base_score=0.5, booster='gbtre...
learning_rate=0.1, max_delta_step=0,
max_depth=5, min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=180, n_jobs=0,
num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None)),
(' sgb classifier',
GradientBoostingClassifier(max_features=0.7,
subsample=0.9))])
In [65]:
vc_acc = accuracy_score(y_test, vc.predict(X_test))print(f"Training Accuracy of Voting Classifier is {accuracy_score(y_train, vc.predict(X_train))}")
print(f"Test Accuracy of Voting Classifier is {vc_acc} \n")print(f"{confusion_matrix(y_test, vc.predict(X_test))}\n")
print(classification_report(y_test, vc.predict(X_test)))Training Accuracy of Voting Classifier is 0.8639200998751561
Test Accuracy of Voting Classifier is 0.8333333333333334 [[49 2]
[13 26]] precision recall f1-score support 0 0.79 0.96 0.87 51
1 0.93 0.67 0.78 39 accuracy 0.83 90
macro avg 0.86 0.81 0.82 90
weighted avg 0.85 0.83 0.83 90
Final Prediction
In [66]:
test['Fare'].fillna(value=test['Fare'].mean(),inplace=True)
In [67]:
test.isnull().sum()
Out[67]:
PassengerId 0
Sex 0
Age 0
Fare 0
Child 0
Teenager 0
Adult 0
Aged 0
C 0
Q 0
S 0
UpperClass 0
MiddleClass 0
LowerClass 0
dtype: int64
In [68]:
test = test.iloc[:,1:]
test
Out[68]:
In [69]:
vc_final_pred = vc.predict(test)
6. Save Prediction As CSV
In [70]:
test_csv =pd.read_csv('../input/titanic/test.csv')
In [71]:
final = {'PassengerId': test_csv['PassengerId'] ,'Survived':vc_final_pred}sub = pd.DataFrame (final, columns = ['PassengerId','Survived'])
sub.to_csv('/kaggle/working/vc_submission.csv', index=False)
sub