Day 29 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
3 min readJul 15, 2020

Kaggle Titanic Dataset. So this is the last part of this series and I will be covering about the models that I decided to use and how I wanted to compare the accuracy among them and then decide to go with a specific model.

Let’s start by splitting the data into our training and testing sets by using sklearn. We have chosen the features and stored them in the predictors variable. The axis=1 represents that all the columns would be chosen.

from sklearn.model_selection import train_test_splitpredictors = train_data.drop(['Survived', 'PassengerId'], axis=1)
target = train_data["Survived"]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size = 0.22, random_state = 0)

So i decided to implement the Logistic Regression model which is the absolute basics for starters, the code for which is given below followed by the accuracy below:

# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_val)
acc_logreg = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_logreg)
Logistic Regression accuracy

Next, I decided to go with SVM classifier, the code for which is given below:

# Support Vector Machines
from sklearn.svm import SVC

svc = SVC()
svc.fit(x_train, y_train)
y_pred = svc.predict(x_val)
acc_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_svc)

The next classifier is the Linear SVC classifier:

# Linear SVC
from sklearn.svm import LinearSVC
linear_svc = LinearSVC()
linear_svc.fit(x_train, y_train)
y_pred = linear_svc.predict(x_val)
acc_linear_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print("The accuracy of Logistic Regression is {}.".format(acc_linear_svc))

The next classifier that I have used is the Decision Tree Classifier:

#Decision Tree
from sklearn.tree import DecisionTreeClassifier
decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)
y_pred = decisiontree.predict(x_val)
acc_decisiontree = round(accuracy_score(y_pred, y_val) * 100, 2)
print("The accuracy of Decision tree is {}.".format(acc_decisiontree))
Decision Tree accuracy

The next model that I decided to use is KNN model.

# KNN or k-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_val)
acc_knn = round(accuracy_score(y_pred, y_val) * 100, 2)
print("The accuracy of KNN model is {}.".format(acc_knn))
KNN accuracy

We now arrange all the models in increasing order of their accuracies using the following piece of code:

models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Linear SVC', 'Decision Tree',],
'Score': [acc_svc, acc_knn, acc_logreg,
acc_linear_svc, acc_decisiontree,]})
models.sort_values(by='Score', ascending=False)

From the given picture, it may be seen that Decision tree gives us the highest accuracy and it can be improved using a Boosting algorithm such as XGBoost etc.

This is basically an overview of how we can approach a number of problems. That’s it for today. Thanks for reading. Keep Learning.

Cheers.

--

--