Logistic Regression
Logistic Regression

Logistic Regression Implementation in Python

Harshita Yadav
Machine Learning with Python

--

In the last blog, we have learned about Simple and Multiple Linear Regression and its implementation in python.

In this blog, we will learn about Logistic Regression and its implementation in Python.

Logistic Regression

Logistic regression comes under the supervised learning technique. It is a classification algorithm that is used to predict discrete values such as 0 or 1, Malignant or Benign, Spam or Not spam, etc.

Logistic regression is based on the concept of probability. It uses a Logistic function, also known as the Sigmoid function. The hypothesis of logistic regression tends to limit the Sigmoid function between 0 and 1. We use this Sigmoid function to map the predicted values to probabilities.

Example: If we have two classes, say dog and cat. Let’s assign 1 for dogs and 0 for cats. By using logistic regression, we basically set a threshold value. The values above the threshold point can be classified as class 1, i.e., dogs, and the values below the threshold point can be classified as class 0, i.e., cats.

Logistic Regression
Logistic Regression

Logistic Regression Implementation in Python

Problem statement: The aim is to make predictions on the survival outcome of passengers.

Since this is a binary classification, logistic regression can be used to build the model.

Dataset source: https://www.kaggle.com/c/titanic/data

Importing the Libraries

#Importing the librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

numpy: NumPy stands for numeric Python, a python package for the computation and processing of the multi-dimensional and single-dimensional array elements.

pandas: Pandas provide high-performance data manipulation in Python.

matplotlib: Matplotlib is a library used for data visualization. It is mainly used for basic plotting. Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots, and so on.

seaborn: Seaborn is a library used for making statistical graphics of the dataset. It provides a variety of visualization patterns. It uses fewer syntax and has easily interesting default themes. It is used to summarize data in visualizations and show the data’s distribution.

Reading the Dataset

#Reading the datasetdataset = pd.read_csv("titanic.csv")

The dataset is in the CSV (Comma-Separated Values) format. Hence, we use pd.read_csv()to read the dataset.

dataset.head()
Titanic Dataset
Titanic Dataset

Dataset Column Description

  • PassenderId: PassengerId is the Id given to all the passengers to identify each individual uniquely.
  • Survived: Survived indicates whether the passenger survived or not (0 for not survived and 1 for survived).
  • Pclass: Passenger class indicates the class a passenger belongs to (1 for 1st class, 2 for 2nd class, and 3 for 3rd class).
  • Name: Name is the name of the passenger.
  • Sex: Sex indicates the gender of the passenger.
  • Age: Age indicates the age of the passenger.
  • SibSp: SibSp indicates the number of siblings/spouses aboard.
  • Parch: Parch indicates the number of parents/children aboard.
  • Ticket: Ticket indicates the ticket number.
  • Fare: Fare is the passenger fare in pounds.
  • Cabin: The cabin indicates the cabin number.
  • Embarked: Embarked indicates port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

Data Pre-Processing

  1. Checking for missing values in the dataset
#Checking for missing valuesdataset.isnull().sum()
Output
Output

The columns Age, Cabin, and Embarked have missing values.

Imputation of missing values

#Median of the Age column
print('Median of Age column: %.2f' % (dataset["Age"].median(skipna = True)))
#Percentage of missing records in the Cabin column
print('Percent of missing records in the Cabin column: %.2f%%' %((dataset['Cabin'].isnull().sum()/dataset.shape[0])*100))
#Most common boarding port of embarkation
print('Most common boarding port of embarkation: %s' %dataset['Embarked'].value_counts().idxmax())
Output
Output

We’ll fill the missing values of the Age column by the median of the age column.

We’ll drop the Cabin column as 77.10% of the records are missing.

We’ll fill the missing values of the Embarked column by the most common port of embarkation, i.e., S, which indicates Southampton.

#Filling Age column by median
dataset["Age"].fillna(dataset["Age"].median(skipna=True), inplace=True)
#Fillimg Embarked column by the most common port of embarkation
dataset["Embarked"].fillna(dataset['Embarked'].value_counts().idxmax(), inplace=True)
#Dropping the cabin columns
dataset.drop('Cabin', axis=1, inplace=True)

Checking missing values after imputation,

#Checking for missing valuesdataset.isnull().sum()
Output
Output

Now, there are no missing values present in the dataset.

2. Dropping unnecessary columns

#Dropping unnecessary columnsdataset.drop('PassengerId', axis=1, inplace=True)
dataset.drop('Name', axis=1, inplace=True)
dataset.drop('Ticket', axis=1, inplace=True)

The columns PassengerId, Name, and Ticked are unnecessary as they do not affect the target variable, i.e., Survived. Therefore, we can drop those columns from the dataset.

3. Combining several related variables and creating a single variable

SibSp and Parch relate to traveling with family. For simplicity’s sake and to account for possible multicollinearity, we can combine the effect of these variables into one predictor variable, i.e., TravelAlone, which will indicate whether or not that individual was traveling alone.

#Creating variable TravelAlonedataset['TravelAlone']=np.where((dataset["SibSp"]+dataset["Parch"])>0, 0, 1)
dataset.drop('SibSp', axis=1, inplace=True)
dataset.drop('Parch', axis=1, inplace=True)

Final columns after pre-processing

dataset.head()
Titanic Dataset
Titanic Dataset

Exploratory Data Analysis

  1. Dataset shape
#Number of rows and columns of train setdataset.shape
Output
Output

There are 891 rows and 7 columns in the dataset.

2. Dataset info

#Dataset infodataset.info()
Output
Output

The data types of the columns are integer, float, and object.

3. Dataset description

#Dataset descriptiondataset.describe()
Output
Output

4. Analysis of Sex feature

#Count of passengers based on gendersns.countplot('Sex',data=dataset)
dataset['Sex'].value_counts()
Output
Output

The above graph shows the count of passengers grouped by gender. There are 577 male passengers and 314 female passengers. Therefore the number of males on board is more than the number of females.

#Percentage of passengers survived grouped by gendersns.barplot(x='Sex',y='Survived',data=dataset)
dataset.groupby('Sex',as_index=False).Survived.mean()
Output
Output

The above graph shows the effect of gender on the survival rate of the passengers. The number of females who survived was much more than the males who survived. 74% of the females survived, and on the other hand, only 18% of the males survived.

#Count of passengers survived based on gendersns.countplot(x='Survived', hue='Sex', data=dataset)
Output
Output

The above graph shows that among the passengers who did not survive, most of them were males. And among the passengers who survived, most of them were females.

5. Analysis of Pclass feature

#Count of passengers based on Pclasssns.countplot('Pclass',data=dataset)
dataset['Pclass'].value_counts()
Output
Output

The above graph shows that the Number of passengers belonging to the 3rd class is 491, which is the maximum, the number of passengers belonging to the 1st class is 216, which is the second-highest, and the number of passengers belonging to the 2nd class is 184 which is the minimum.

#Precentage of passengers survived grouped by Pclasssns.barplot(x='Pclass',y='Survived',data=dataset)
dataset.groupby('Pclass',as_index=False).Survived.mean()
Output
Output

The above graph shows that the survival rate of the passengers belonging to 1st class is more, i.e., 62.96% and the survival rate of the passengers belonging to 3rd class is least, i.e., 24.23%.

#Count of passengers survived based on Pclasssns.countplot(x='Survived', hue='Pclass', data=dataset)
Output
Output

It is clear from the above graph that most of them belong to the 3rd class among the passengers who did not survive. And among the passengers who survived, most of them belong to 1st class.

6. Analysis of Embarked feature

#Count of the passengers basen on Emabarkedsns.countplot('Embarked',data=dataset)
dataset['Embarked'].value_counts()
Output
Output

The above graph shows that the number of passengers traveling from Southampton port is 646, which is the maximum, the number of passengers traveling from Cherbourg port is 168, and the number of people traveling from Queenstown is 77, which is the minimum.

#Precentage of the passengers grouped by port of Embarkationsns.barplot(x='Embarked',y='Survived',data=dataset)
dataset.groupby('Embarked',as_index=False).Survived.mean()
Output
Output

The above graph shows that the passengers that embarked from port Cherbourg had a higher rate of Survival at 55.35%

#Count of passengers survived based on port of Embarkationsns.countplot(x='Survived', hue='Embarked', data=dataset)
Output
Output

It is clear from the above graph that most of them embarked from Southampton among the passengers who did not survive. And among the passengers who survived, most of them embarked from Southampton.

7. Analysis of TravelAlone feature

#Count of passengers based on TravelAlonesns.countplot('TravelAlone',data=dataset)
dataset['TravelAlone'].value_counts()
Output
Output

The above graph shows that 354 passengers were traveling alone, and 537 passengers were not traveling alone.

##Precentage of the passengers grouped by TravelAlonesns.barplot(x='TravelAlone',y='Survived',data=dataset)
dataset.groupby('TravelAlone',as_index=False).Survived.mean()
Output
Output

The above graph shows that passengers traveling alone were more likely to survive.

8. Analysis of Age feature

#Age Distributiondataset.Age.hist()
print("The Median age of passengers is:", int(dataset.Age.median()))
Output
Output

The above histogram shows the age distribution. The passengers on the ship were between 0 to 80 years. Therefore, we can say that it is normally distributed.

#Age group which is more likely to survivesns.lmplot(x='Age',y='Survived',data=dataset)
Output
Output

It is clear from the above graph that younger individuals were more likely to survive as we can see that the regression line shows a negative correlation. Therefore increase in age leads to a lesser chance of survival.

9. Analysis of Survived feature

#Count of the passengers survivedsns.countplot('Survived',data=dataset)
dataset['Survived'].value_counts()
Output
Output

The above graph shows that 549 passengers did not survive for the given dataset, and 342 passengers survived.

10. Correlation Matrix

dataset.corr()
Output
Output

The above matrix shows the correlation among the variables.

Model Building

Before building the model, we need to perform label encoding for the categorical variables because categorical data must be encoded into numbers before using it to fit and evaluate a model.

Label encoding

#Import label encoder
from sklearn import preprocessing

#label_encoder object knows how to understand word labels
label_encoder = preprocessing.LabelEncoder()

#Encode labels in column Sex and Embarked
dataset['Sex']= label_encoder.fit_transform(dataset['Sex'])
dataset['Embarked']=label_encoder.fit_transform(dataset['Embarked'])

sklearn.preprocessingIt provides several common utility functions and transformer classes to change raw feature vectors into a more suitable representation for the downstream estimators.

LabelEncoder(): It is used to transform non-numerical labels into numerical labels.

fit_transform(): It is used to fit the label encoder, and it returns the encoded labels.

dataset.head()
Output
Output

Setting the values for independent (X) variable and dependent (Y) variable

#Setting the value for dependent and independent variablesX = dataset.drop('Survived', 1)
y = dataset.Survived

Splitting the dataset into train and test set

#Splitting the datasetfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

from sklearn.model_selection import train_test_split: It is used for splitting data arrays into two subsets: for training data and testing data. With this function, you don’t need to divide the dataset manually.

We need to split our dataset into training and testing sets. We’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually good to keep 70% of the data in your train dataset and the rest 30% in your test dataset.

test_size: This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.

randon_state: This parameter controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

Implementing the Logistic Regression Model

#Fitting the Logistic Regression modelfrom sklearn.linear_model import LogisticRegressionlr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

from sklearn.linear_model import LogisticRegression: It is used to perform Logistic Regression in Python.

To build a logistic regression model, we need to create an instance of LogisticRegression() class and use x_train, y_train to train the model using the fit() method of that class. Now, the variable lr_model is an instance of the LogisticRegression() class.

Prediction on the test set

#Prediction of test sety_pred = lr_model.predict(X_test)#Predicted valuesy_pred

Once we have fitted (trained) the model, we can make predictions using the predict() function. We pass the values of x_test to this method and compare the predicted values called y_pred with y_test values to check how accurate our predicted values are.

Actual values and the predicted values

#Actual value and the predicted valuea = pd.DataFrame({'Actual value': y_test, 'Predicted value':y_pred})
a.head()
Actual and Predicted values
Actual and Predicted values

Evaluating the Model

#Confusion matrix and classification reportfrom sklearn import metrics 
from sklearn.metrics import classification_report, confusion_matrix
matrix = confusion_matrix(y_test, y_pred)sns.heatmap(matrix, annot=True, fmt="d")
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
print(classification_report(y_test, y_pred))

metrics: It consists of the function that is used to evaluate machine learning algorithms in python.

confusion_matrix(): It is a table that is used to describe the performance of a classification model on a set of test data for which the true values are known.

classification_report(): It is used to measure the quality of predictions from a classification algorithm.

Classification Report and Confusion Matrix
Classification Report and Confusion Matrix

Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. The accuracy obtained from the classification report is 0.79, which indicates that the accuracy of the model is 79%.

Precision: It is the number of correct positive results divided by the number of positive results predicted by the classifier. The precision obtained from the classification report is 0.79, which indicates that the precision of the model is 79%.

Recall: Recall gives a measure of how accurately our model can identify the relevant data. The recall value obtained from the classification report is 0.87, which indicates that the model can identify 87% of the relevant data.

f1-score: f1-Score is used to measure a test’s accuracy. The f1-score obtained from the classification report is 0.83, which indicates that the test accuracy is 83%.

Conclusion

There were 891 records in the dataset, out of which 70% of the data was given for training the model and 30% of the data, i.e., 268 records, were given for testing the model. And out of 268 records, 57 records were misclassified.

Hey guys! I’m Harshita. I’m a Data Science student and trying to contribute a bit to the community by sharing my knowledge. Please share this with someone you know who is trying to learn Machine Learning. I would appreciate your comments, suggestions, or feedback. Thank you.

Email Id: harshita.1128@gmail.com

LinkedIn: www.linkedin.com/in/harshita-11

Github: www.github.com/Harshita0109

--

--

Harshita Yadav
Machine Learning with Python

MSc Data Science student at Christ (Deemed to be University)