Titanic Dataset (Top 7%) EDA and Prediction

12 min readOct 11, 2021

Dataset Overview 🚢

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class ©, their sex (G), and the fare they paid (X).

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, you have to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

My Submission

Steps I used in this kernel :

Import Libraries And read Files
Basic Insights of Data
Feature Engineering
Imputation ( Numerical Encoding, Handling Outliers, Binning, Data Visualization)
Modeling ( TRAIN-TEST SPLIT, Model Implementation)
Saving CSV

1. Importing libraries and read files

In [1]:

import os
import numpy   as np 
import pandas  as pd 
import seaborn as sns
from matplotlib import pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifierimport warnings
warnings.filterwarnings("ignore")

In [2]:

data = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
ss = pd.read_csv('../input/titanic/gender_submission.csv')

Now let’s print out some random rows from the dataset using “sample()” function.

In [3]:

data.sample(5)

Let’s have a look over the first 5 and last 5 rows of the dataset

Top 5 values of the dataset

In [5]:

data.head()

Last 5 values of the dataset

In [6]:

data.tail()

2. Basic insights of data

In this part we will go through the shape of the dataset, what columns it has and the theoritical and statistical summary of the dataset

In [7]:

data.shape

Out[7]:

(891, 12)

In [8]:

data.columns

Out[8]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Theoretical information about data =>

In [9]:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Statistical information about data =>

In [10]:

data.describe()

Out[10]:

3. Feature Engineering

3.1 Imputation

Removing null/missing values

We are removing these null values because they adversely affect the performance and accuracy of any machine learning algorithm. So, removing null values from the dataset before modeling is one of the important steps in data wrangling.

In [11]:

data.isnull().sum()

Out[11]:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We have a Cabin column with 687 out of 891 values as null so we will drop it.

In [12]:

data.drop('Cabin',axis=1,inplace=True)

Now we left with Embarked with 2 values as null and Age with 177 values as null so we will replace them with suitable mode and mean.

In [13]:

data['Embarked'].fillna(value = data['Embarked'].mode,inplace=True)
data['Age'].fillna(value = data['Age'].mean(),inplace = True)

In [14]:

data.isnull().sum()

Out[14]:

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

So here we are all sorted with null values.

In [15]:

data.drop('Embarked',axis=1,inplace=True)
data.drop('Name',axis=1,inplace=True)

In [16]:

data.columns

Out[16]:

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare'],
      dtype='object')

3.2 Numerical Encoding

In [17]:

temp = pd.read_csv('../input/titanic/train.csv') # all extra modifications will be done on it.

In [19]:

df_test.head()

Out[19]:

Creating a new dataframe temp with limited columns in it, then preparing the test set.

In [20]:

new_data = temp.drop('Ticket',axis=1)
df_test.drop('Ticket',axis=1,inplace=True)
new_data.head()

Out[20]:

Now we will just replace all null values from column “Age” by replacing them with the mode of the whole column. (most common age)

We will be using pandas function fillna() for this purpose.

In [21]:

mode_value = new_data['Age'].mode()
mode_t_value = df_test['Age'].mode()
mode_value,mode_t_value

Out[21]:

(0    24.0
 dtype: float64,
 0    21.0
 1    24.0
 dtype: float64)

In [22]:

new_data['Age'].fillna(value = 24.0,inplace = True)
df_test['Age'].fillna(value = 24.0,inplace = True)

Dropping the unwanted columns out of the final dataset

In [23]:

try:
    new_data.drop('Name',axis=1,inplace=True)
    df_test.drop('Name',axis=1,inplace=True)
except:
    print("Name Already Dropped !")
try:
    new_data.drop('Cabin',axis=1,inplace=True)
    df_test.drop('Cabin',axis=1,inplace=True)
except:
    print("Cabin Already Dropped !")

Now as we can see that the columns “Sex” is categorical but plays an important role in predicting the final prediction. So we will convert this column to a numerical column using pandas function get_dummies().

pandas.get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

In [24]:

new_data['Sex'] = pd.get_dummies(new_data['Sex'])
df_test['Sex'] = pd.get_dummies(df_test['Sex'])
new_data.head()

Out[24]:

3.3 Handling Outliers

In [25]:

new_data.boxplot()

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3fea14f150>

From the above boxplot, we can see that some values of the Fare column are far away from the normal range of other data values. This means we have to check for the outliers in this column.

For this I will use 2 plotting methods, the first one is boxplot of that particular column and the second one is kdeplot() (used for distribution plot).

In [26]:

fig,axes = plt.subplots(1,2,figsize=(14,5))
sns.boxplot(new_data['Fare'],ax=axes[0]).set_title('Fare Box Plot Before',fontsize=18)
sns.kdeplot(new_data['Fare'],ax=axes[1]).set_title('Fare Distribution Plot Before',fontsize=18)
plt.show()

Here it’s clear that our data contains outliers and is not normalized. So let's remove these outliers by Interquartile Range.

What is Interquartile Range IQR?

IQR is used to measure variability by dividing a data set into quartiles. The data is sorted in ascending order and split into 4 equal parts.

Q1, Q2, Q3 called first, second and third quartiles are the values that separate the 4 equal parts:

Q1 represents the 25th percentile of the data.

Q2 represents the 50th percentile of the data.

Q3 represents the 75th percentile of the data.

And if a dataset has 2n / 2n+1 data points, then

Q1 = median of the dataset.

Q2 = median of n smallest data points.

Q3 = median of n highest data points.

IQR is the range between the first and the third quartiles namely Q1 and Q3: IQR = Q3 — Q1. The data points which fall below Q1–1.5 IQR or above Q3 + 1.5 IQR are outliers.

In [27]:

Q1=data['Fare'].quantile(0.10)
Q3=data['Fare'].quantile(0.80)
IQR= Q3-Q1
new_data.loc[(new_data['Fare'] > (Q3 + 1.5*IQR)),'Fare'] = Q3
new_data.loc[(new_data['Fare'] < (Q1 - 1.5*IQR)),'Fare'] = Q1t_Q1=df_test['Fare'].quantile(0.10)
t_Q3=df_test['Fare'].quantile(0.80)
t_IQR= t_Q3-t_Q1df_test.loc[(df_test['Fare'] > (t_Q3 + 1.5*t_IQR)),'Fare'] = t_Q3
df_test.loc[(df_test['Fare'] < (t_Q1 - 1.5*t_IQR)),'Fare'] = t_Q1print(Q1,Q3)
print(new_data.shape)7.55 39.6875
(891, 9)

In [28]:

fig,axes = plt.subplots(1,2,figsize=(14,5))
sns.boxplot(new_data['Fare'],ax=axes[0]).set_title('Fare Box Plot After',fontsize=18)
sns.kdeplot(new_data['Fare'],ax=axes[1]).set_title('Fare Distribution Plot After',fontsize=18)
plt.show()

Now we have all the outliers removed from the column Fare

3.4 Binning

The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance.

In [30]:

new_data['Age'].min(),new_data['Age'].max()

Out[30]:

(0.42, 80.0)

So we have an age range of (0.42year — 80years)

In [31]:

for i in new_data['Age'].values: 
    if i<1:
        print(i)0.83
0.92
0.75
0.75
0.67
0.42
0.83

In [32]:

# AGE_RANGE
new_data['Age_Range'] = pd.cut(new_data['Age'], bins=[0,9,18,60,100], labels=["Child","Teenager","Adult", "Aged"])
df_test['Age_Range'] = pd.cut(df_test['Age'], bins=[0,9,18,60,100], labels=["Child","Teenager","Adult", "Aged"])
new_data.head()

Out[32]:

In [33]:

new_data = new_data.join(pd.get_dummies(new_data['Age_Range']))
df_test = df_test.join(pd.get_dummies(df_test['Age_Range']))
age_range = new_data['Age_Range'].values
new_data.drop('Age_Range',axis=1,inplace=True)
df_test.drop('Age_Range',axis=1,inplace=True)
new_data.head()

After creating bins for Age column we can now move to Embarked column

In [34]:

embarked = pd.get_dummies(new_data['Embarked'])
new_data = new_data.join(embarked)
new_data.drop('Embarked',axis=1,inplace=True)T_embarked = pd.get_dummies(df_test['Embarked'])
df_test = df_test.join(T_embarked)
df_test.drop('Embarked',axis=1,inplace=True)new_data.head()

In [35]:

new_data.drop('Parch',axis=1,inplace=True)
new_data.drop('SibSp',axis=1,inplace=True)df_test.drop('Parch',axis=1,inplace=True)
df_test.drop('SibSp',axis=1,inplace=True)

Performing the same task over the Pclass column

In [36]:

Pclass = pd.get_dummies(new_data['Pclass'])
Pclass.columns=['UpperClass', 'MiddleClass','LowerClass']
new_data = new_data.join(Pclass)
new_data.drop('Pclass',axis=1,inplace=True)Pclass = pd.get_dummies(df_test['Pclass'])
Pclass.columns=['UpperClass', 'MiddleClass','LowerClass']
df_test = df_test.join(Pclass)
df_test.drop('Pclass',axis=1,inplace=True)new_data.head()

Out[36]:

In [37]:

df_test.head()

Out[37]:

4. DATA VISUALIZATION

In [38]:

temp.head() # copy of train preprocessed, will be used for visualization

In [39]:

new_data.corr()

Out[39]:

In [40]:

fig,axes = plt.subplots(1,2,figsize=(20,7))
plt.suptitle('Orignal v/s Featured Data', fontsize=18)
sns.heatmap(temp.corr(),ax=axes[0]).set_title('Orignal Data')
sns.heatmap(new_data.corr(),ax=axes[1]).set_title('Featured Data')
plt.show()

From above heatmap graphs it’s clear that now we have a lot more features to process and consider for our prediction model.

In [41]:

sns.pairplot(data)
plt.show()

In [42]:

fig,axes = plt.subplots(1,3,figsize=(15,5))
sns.distplot(new_data['Fare'],ax=axes[0])
sns.distplot(data['Age'],ax=axes[1])
sns.distplot(data['Pclass'],ax=axes[2])
plt.show()

In [43]:

fig,axes = plt.subplots(2,2,figsize=(16,10))
sns.scatterplot(new_data['Age'],new_data['Fare'],ax=axes[0,0])
sns.scatterplot(new_data['Age'],new_data['Survived'],ax=axes[0,1])
sns.barplot(new_data['Sex'],new_data['Age'],ax=axes[1,0])
sns.barplot(new_data['Sex'],new_data['Survived'],ax=axes[1,1])
plt.show()

In [44]:

new_data.head()

In [45]:

fig, ax = plt.subplots(1,2,figsize=(16,5))
sns.countplot(age_range,ax=ax[0]).set_title('Count of Age Group',fontsize=16)
sns.barplot(age_range,new_data['Survived'],ax=ax[1]).set_title('Survived v/s Age Group',fontsize=16)

Out[45]:

Text(0.5, 1.0, 'Survived v/s Age Group')

In [46]:

fig, ax = plt.subplots(1,2,figsize=(16,5))
sns.countplot(temp['Embarked'],ax=ax[0]).set_title('Count of Embarked',fontsize=16)
sns.barplot(temp['Embarked'],new_data['Survived'],ax=ax[1]).set_title('Survived v/s Embarked',fontsize=16)

Out[46]:

Text(0.5, 1.0, 'Survived v/s Embarked')

In [47]:

fig, ax = plt.subplots(1,2,figsize=(16,5))
sns.countplot(temp['Pclass'],ax=ax[0]).set_title('Count of Passenger Class',fontsize=16)
sns.barplot(temp['Pclass'],new_data['Survived'],ax=ax[1]).set_title('Survived v/s Passenger Class',fontsize=16)
plt.show()

5. MODELING

5.1 TRAIN-TEST SPLIT

In [48]:

new_data.isnull().sum()

Out[48]:

PassengerId    0
Survived       0
Sex            0
Age            0
Fare           0
Child          0
Teenager       0
Adult          0
Aged           0
C              0
Q              0
S              0
UpperClass     0
MiddleClass    0
LowerClass     0
dtype: int64

In [49]:

# we have new_data for training purpose and df_test for prediction So lets create testing data.
train = new_data
test = df_test
train.head()

Out[49]:

In [50]:

test.head()

Out[50]:

In [51]:

X = train.drop('Survived',axis=1)
y = train['Survived']
X = X.iloc[:,1:]
X.head(3)

Out[51]:

In [52]:

y.head(5)

Out[52]:

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [53]:

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.10,random_state=0)
X_train.shape,y_train.shape,X_test.shape,y_test.shape

Out[53]:

((801, 13), (801,), (90, 13), (90,))

5.2 Model Implementation </h1>

In [54]:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Support-Vector Machine

In [55]:

linear_svc = LinearSVC()
linear_svc.fit(X_train, y_train)svc_pred = linear_svc.predict(X_test)linear_svc.score(X_train, y_train)acc_linear_svc = round(linear_svc.score(X_test, y_test) * 100, 2)
print(round(acc_linear_svc,2,), "%")
print(confusion_matrix(y_test,svc_pred))
80.0 %
[[47  4]
 [14 25]]

In [56]:

print(classification_report(y_test,svc_pred))precision    recall  f1-score   support           0       0.77      0.92      0.84        51
           1       0.86      0.64      0.74        39    accuracy                           0.80        90
   macro avg       0.82      0.78      0.79        90
weighted avg       0.81      0.80      0.79        90

Random-Forest

Hyperparameters I used:

max_features: max number of features considered for splitting the node
n_estimatorsint, default=100: The number of trees in the forest.
criterion{“gini”, “entropy”}: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split: The minimum number of samples required to split an internal node.

Changed in version 0.18: Added float values for fractions.

min_samples_leaf: The minimum number of samples required to be at a leaf node. This may have the effect of smoothing the model, especially in regression.

max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”

The number of features to consider when looking for the best split:

If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

In [57]:

random_forest1 = RandomForestClassifier(criterion='entropy', n_estimators=110,min_samples_split=40,min_samples_leaf=3, max_depth=5, max_features='auto',oob_score=True,random_state=1)random_forest1.fit(X_train, y_train)
rf_pred1 = random_forest1.predict(X_test)random_forest1.score(X_train, y_train)acc_rf1 = round(random_forest1.score(X_test, y_test) * 100, 2)
print(round(acc_rf1,2,), "%")
print(confusion_matrix(y_test,rf_pred1))81.11 %
[[49  2]
 [15 24]]

In [58]:

print(random_forest1.oob_score_)
print(classification_report(y_test,rf_pred1))0.8227215980024969
              precision    recall  f1-score   support           0       0.77      0.96      0.85        51
           1       0.92      0.62      0.74        39    accuracy                           0.81        90
   macro avg       0.84      0.79      0.80        90
weighted avg       0.83      0.81      0.80        90

Logistic Regression

In [59]:

LogReg = LogisticRegressionCV(cv=5)LogReg.fit(X_train,y_train)
LR_pred = LogReg.predict(X_test)LogReg.score(X_train, y_train)acc_LR = round(LogReg.score(X_test, y_test) * 100, 2)
print(round(acc_LR,2,), "%")
print(confusion_matrix(y_test,LR_pred))81.11 %
[[43  8]
 [ 9 30]]

In [60]:

print(classification_report(y_test,LR_pred))precision    recall  f1-score   support           0       0.83      0.84      0.83        51
           1       0.79      0.77      0.78        39    accuracy                           0.81        90
   macro avg       0.81      0.81      0.81        90
weighted avg       0.81      0.81      0.81        90

Gradient Boost Classifier

In [61]:

from sklearn.ensemble import GradientBoostingClassifiergb = GradientBoostingClassifier()
gb.fit(X_train, y_train)# accuracy score, confusion matrix and classification report of gradient boosting classifiergb_acc = accuracy_score(y_test, gb.predict(X_test))print(f"Training Accuracy of Gradient Boosting Classifier is {accuracy_score(y_train, gb.predict(X_train))}")
print(f"Test Accuracy of Gradient Boosting Classifier is {gb_acc} \n")print(f"Confusion Matrix :- \n{confusion_matrix(y_test, gb.predict(X_test))}\n")Training Accuracy of Gradient Boosting Classifier is 0.898876404494382
Test Accuracy of Gradient Boosting Classifier is 0.8333333333333334 Confusion Matrix :- 
[[49  2]
 [13 26]]

XGBoost Classifier

XGB classifier hyperparameters

n_estimators = no of trees created in XGB
colsample_bytree = percentage of columns you want to select from a tree for helping to overfit and speeding up the process
max_depth = depth of each tree
alpha = learning rate ( used when getting the predicted values )
lambda = regularization parameter
gamma = it is a user-defined penalty (it encourages pruning the trees)
min_child_weight = For regression, that is the minimum number of observations that go to a leaf. For classification, it is the minimum of the hessian

In [62]:

from xgboost import XGBClassifierxgb = XGBClassifier(booster = 'gbtree', learning_rate = 0.1, max_depth = 5, n_estimators = 180)
xgb.fit(X_train, y_train)# accuracy score, confusion matrix and classification report of xgboostxgb_acc = accuracy_score(y_test, xgb.predict(X_test))print(f"Training Accuracy of XgBoost is {accuracy_score(y_train, xgb.predict(X_train))}")
print(f"Test Accuracy of XgBoost is {xgb_acc} \n")Training Accuracy of XgBoost is 0.9338327091136079
Test Accuracy of XgBoost is 0.8444444444444444

Stochastic gradient boosting classifier

In [63]:

sgb = GradientBoostingClassifier(subsample = 0.90, max_features = 0.70)
sgb.fit(X_train, y_train)# accuracy score, confusion matrix and classification report of stochastic gradient boosting classifiersgb_acc = accuracy_score(y_test, sgb.predict(X_test))print(f"Training Accuracy of Stochastic Gradient Boosting is {accuracy_score(y_train, sgb.predict(X_train))}")
print(f"Test Accuracy of Stochastic Gradient Boosting is {sgb_acc} \n")print(f"Confusion Matrix :- \n{confusion_matrix(y_test, sgb.predict(X_test))}\n")Training Accuracy of Stochastic Gradient Boosting is 0.8901373283395755
Test Accuracy of Stochastic Gradient Boosting is 0.8222222222222222 Confusion Matrix :- 
[[48  3]
 [13 26]]

Voting Classifier

In [64]:

from sklearn.ensemble import VotingClassifierclassifiers = [('SVM',linear_svc), ('Random Forest', random_forest1), ('Logistic', LogReg),('Gradient Boost',gb),('XGBoost',xgb),(' sgb classifier',sgb)]
vc = VotingClassifier(estimators = classifiers)
vc.fit(X_train, y_train)

Out[64]:

VotingClassifier(estimators=[('SVM', LinearSVC()),
                             ('Random Forest',
                              RandomForestClassifier(criterion='entropy',
                                                     max_depth=5,
                                                     min_samples_leaf=3,
                                                     min_samples_split=40,
                                                     n_estimators=110,
                                                     oob_score=True,
                                                     random_state=1)),
                             ('Logistic', LogisticRegressionCV(cv=5)),
                             ('Gradient Boost', GradientBoostingClassifier()),
                             ('XGBoost',
                              XGBClassifier(base_score=0.5, booster='gbtre...
                                            learning_rate=0.1, max_delta_step=0,
                                            max_depth=5, min_child_weight=1,
                                            missing=nan,
                                            monotone_constraints='()',
                                            n_estimators=180, n_jobs=0,
                                            num_parallel_tree=1, random_state=0,
                                            reg_alpha=0, reg_lambda=1,
                                            scale_pos_weight=1, subsample=1,
                                            tree_method='exact',
                                            validate_parameters=1,
                                            verbosity=None)),
                             (' sgb classifier',
                              GradientBoostingClassifier(max_features=0.7,
                                                         subsample=0.9))])

In [65]:

vc_acc = accuracy_score(y_test, vc.predict(X_test))print(f"Training Accuracy of Voting Classifier is {accuracy_score(y_train, vc.predict(X_train))}")
print(f"Test Accuracy of Voting Classifier is {vc_acc} \n")print(f"{confusion_matrix(y_test, vc.predict(X_test))}\n")
print(classification_report(y_test, vc.predict(X_test)))Training Accuracy of Voting Classifier is 0.8639200998751561
Test Accuracy of Voting Classifier is 0.8333333333333334 [[49  2]
 [13 26]]              precision    recall  f1-score   support           0       0.79      0.96      0.87        51
           1       0.93      0.67      0.78        39    accuracy                           0.83        90
   macro avg       0.86      0.81      0.82        90
weighted avg       0.85      0.83      0.83        90

Final Prediction

In [66]:

test['Fare'].fillna(value=test['Fare'].mean(),inplace=True)

In [67]:

test.isnull().sum()

Out[67]:

PassengerId    0
Sex            0
Age            0
Fare           0
Child          0
Teenager       0
Adult          0
Aged           0
C              0
Q              0
S              0
UpperClass     0
MiddleClass    0
LowerClass     0
dtype: int64

In [68]:

test = test.iloc[:,1:]
test

Out[68]:

In [69]:

vc_final_pred = vc.predict(test)

6. Save Prediction As CSV

In [70]:

test_csv =pd.read_csv('../input/titanic/test.csv')

In [71]:

final = {'PassengerId': test_csv['PassengerId'] ,'Survived':vc_final_pred}sub = pd.DataFrame (final, columns = ['PassengerId','Survived'])
sub.to_csv('/kaggle/working/vc_submission.csv', index=False)
sub

If you enjoyed reading this article. A 👏 will motivate me to do more of this type of work.

Also if there is any feedback or suggestion please let me know in the comment section.