Your Guide for Logistic Regression with Titanic Dataset
Next machine learning algorithm we’ll be talking about is logistic regression (also called Sigmoid Function). As I write down, I will learn more, like you. So, let’s take a trip to logistic regression, and mark one more algorithm as DONE!
In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
This is how Wikipedia explains. It looks like pretty simple, right? The model’s purpose is giving us 0 or 1. The model predict a value between 0 and 1, and this value shows the probability of that situation.
Logistic regression doesn’t predict the continuous values. Logistic regression predicts whether something is True or False.
Let’s go through an example. Actually, it is a pretty famous one. Titanic Dataset. You have more than one features, and with logistic regression you predict whether they dead or not dead. If the value the model predict would be 0.79, that would mean the person is 79% alive, 21% dead.
When the probability is grater or equal than 0.5, binary value is 1, when the probability is less than 0.5, binary value is 0. So, the person I just mentioned above will be classify as 1, alive. Model returns 1 (True).
Logistic regression’s graph looks like ‘S’ between 0 and 1, as you can see here:
The curve indicates the probability of a case.
Looks similar to linear regression but matter in fact, it is not. Linear regression is a prediction algorithm. On the other hand, logistic regression is a classification algorithm. Linear regression algorithm was using least squares to fit the best line to the data but logistic regression cannot use that method. So, it needs another one. Logistic regression uses ‘maximum likelihood’ to fit the best line to the data.
What is maximum likelihood?!
Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample (X).
I am going to keep this very simple. I will not use fancy words to describe it. I will write as I understand. It calculates the likelihood of one person’s being alive on Titanic dataset, and then another one, and then another one, after all calculations are over, model multiply all of those likelihoods, fits the S-shaped line to the data. Keeps calculating, until it finds the best S-shaped line.
You can read this article if you want to learn more.
Or, you can watch this video.
Now, we are going to implement logistic regression on Titanic dataset. We are not sure either this algorithm is the best match for this dataset but we will find out together.
Dataset is available via Seaborn library. If you don’t have Seaborn library already installed, you can install like this on command line:
pip3 install seaborn # check Seaborn documentations for details
Now, we can import our libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import load_dataset # this method will help us to #download the Titanic dataset
%matplotlib inline # if you use jupyter notebook
plt.style.use('ggplot') # check for more with plt.style.available
Downloading dataset:
data = load_dataset("titanic")
data
You have to see something like this after the code above. If everything is alright, you should know that this is not just an article that we will use logistic regression only. We first clean data, then visualize it, and then will implement logistic regression.
data.info()
As you can see we have null values in ‘age’, ‘embarked’, ‘deck’, ‘embarked_town’ columns. We will drop some of them, and deal with the rest.
Some of the columns have the same values with different value types or names. Like ‘who’, ‘sex’, and ‘adult_male’. I don’t want them in my model.
columns = ['alive', 'alone', 'embark_town', 'who', 'adult_male', 'deck']
data_2 = data.drop(columns, axis=1)
‘columns’ list contains the names which I want to drop from my dataset. ‘drop’ method will drop them. axis=1, means we want as column to drop.
I assigned my new dataset to another variable. If you want, you can make permanent changes on your original data by using ‘inplace = True’.
data_2.describe(include='all').T
print(f"Max value of age column : {data_2['age'].max()}")
print(f"Min value of age column : {data_2['age'].min()}")
>> Max value of age column : 80.0
>> Min value of age column : 0.42
We can categorize age column with values between 0 and 80.
bins = [0, 5, 17, 25, 50, 80]
labels = ['Infant', 'Kid', 'Young', 'Adult', 'Old']
data_2['age'] = pd.cut(data_2['age'], bins = bins, labels=labels)
Pandas ‘cut’ method will let us make our own categorization.
pd.DataFrame(data_2['age'].value_counts())
Wola! We can see that adults are in the majority.
We still have null values in ‘age’ column.
data_3['age'].mode()[0]
>> 'Adult'
We can fill null values with mode of this column. This is an option, and that’s what I’ll do!
data_4 = data_3.fillna({'age' : data_3['age'].mode()[0]})
We are done with ‘age’ column. Yes! It’s time for ‘embarked’ column.
data_2['embarked'].unique()
>> array(['S', 'C', 'Q', nan], dtype=object)
Our ‘embarked’ columns contains ‘S, C, Q’ and ‘nan’ obviously.
print(f"How many 'S' on embarked column : {data_2[data_2['embarked'] == 'S'].shape[0]}")
print(f"How many 'C' on embarked column : {data_2[data_2['embarked'] == 'C'].shape[0]}")
print(f"How many 'Q' on embarked column : {data_2[data_2['embarked'] == 'Q'].shape[0]}")
>> How many 'S' on embarked column : 644
>> How many 'C' on embarked column : 168
>> How many 'Q' on embarked column : 77
Looks like we can use mode of this column to fill nan values.
data_3 = data_2.fillna({'embarked' : 'S'})data_4[['pclass', 'survived']].groupby(['pclass']).sum().sort_values(by='survived')
data_4[['sex', 'survived']].groupby(['sex']).sum().sort_values(by='survived')
bins = [-1, 7.9104, 14.4542, 31, 512.330]
labels = [’low’, 'medium-low’, 'medium’, 'high’]
data_4[’fare’] = pd.cut(data_4["fare"], bins = bins, labels = labels)
We can categorize fare column too, like the code above.
Final version of our dataset.
We should drop ‘class’ too because it is same as pclass, and pclass is already numeric.
data_5 = data_4.drop('class', axis=1)sns.distplot(data_5['survived'])
plt.figure(figsize=(20, 10))
plt.subplot(321)
sns.barplot(x = 'sibsp', y = 'survived', data = data_5)
plt.subplot(322)
sns.barplot(x = 'fare', y = 'survived', data = data_5)
plt.subplot(323)
sns.barplot(x = 'pclass', y = 'survived', data = data_5)
plt.subplot(324)
sns.barplot(x = 'age', y = 'survived', data = data_5)
plt.subplot(325)
sns.barplot(x = 'sex', y = 'survived', data = data_5)
plt.subplot(326)
sns.barplot(x = 'embarked', y = 'survived', data = data_5);
Now, machine learning models hates non-numeric values. We cannot put them to our train and test data. We need to convert them into numeric values. You have two options for that; Label Encoder, and Pandas get_dummies method. I won’t go into this in detail. I am going to use get_dummies.
dummies = ['fare', 'age', 'embarked', 'sex']
dummy_data = pd.get_dummies(data_5[dummies])
Let’s open it up a little. ‘dummies’ contains column names that we want to convert into numeric values. Every variable in them will become a column, and their absence will define as 0, and 1 whether they’re true for that passenger or not.
dummy_data.shape
>> (891, 10)
We will concat the two data frames, and drop the old columns.
data_6 = pd.concat([data_5, dummy_data], axis = 1)
data_6.drop(dummies, axis=1, inplace=True)
Now we have 891 rows, 18 columns. We are ready to build our model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
Since we imported necessary libraries for our model, we are good to go.
X = data_6.drop('survived', axis = 1)
y = data_6['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)# X contains independent values, y contains dependent value
Model building:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
y_pred
y_pred looks like this:
array([0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1,
0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1])
We can check the accuracy score of our model.
accuracy_score(y_pred, y_test)
>> 0.8067796610169492
confusion_matrix(y_pred, y_test)
>> array([[158, 31],
[ 26, 80]])# 31 + 26 = 57 wrong prediction
To be honest, it doesn’t look good. We can do better. We can try different models for this dataset. But this article is about logistic regression. Like I said before we can not know if logistic regression is the right choice for Titanic dataset. Well, if you’re doing this for a long time, you may be. But I don’t.
The good thing is, now you know logistic regression algorithm, and how to implement it!