Logistic Regression — Implementation from scratch

Parth Dhameliya
10 min readMar 30, 2020

--

Logistic Regression is a classification algorithm.Logistic Regression consist of logistic or a sigmoid function to classify classes based on given data.To learn “Linear regression” you can also visit http://link.medium.com/o8drE5cyG4

Logistic Regression Decision boundary separating classes

Lets start with training process step by step with an example. Suppose we want to detect diabetes whether the subject has diabetes or not.Here we are having two classes whether the subject has a diabetes or not(1 or 0).

Model Representation:

Input Variable or features (X) : Glucose,Blood-Pressure,Insulin,Age,etc..

Output Variable or Target (Y) : 1 or 0 (diabetes — yes [1] or No diabetes — no[0]).Here our target Variable is “Outcome”.

m : Number of Training examples(rows)

n : Number of features(columns;excluding Target column).Here number of features are 8.[Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age]

X(i) : i-th training example for eg, X1(0) is 6 X1(1) is 1 similarly for y(i)

Overall Matrix conversion of above dataset

Here Matrix X contains Features for e.g X1 = Pregnancies,X2 = Glucose,X3 = BloodPressure and so on.Here X is having dimension (m,n+1) and theta with dimension (n+1,1) where n is the number of features.Here including a column vector of ones (X0) so that the dimension of X becomes (m,n+1).

Now here Θ is initially taken as zero, we want value of theta to be appropriate so that overall value of H(Θ) i.e predicted y is having minimum error with the actual y.

Matrix Y represents target variable i.e Y = Outcome from the above Data-Set.

Hypothesis or Prediction function: We are going to denote prediction function as H(Θ).

Z = Θ0*X0(i) + Θ1*X1(i)

H(Θ) = g(Θ0*X0(i) + Θ1*X1(i))

H(Θ) = g(Z)

Here g(Z) is a sigmoid function which gives value from 0 to 1.

If H(Θ) ≥ 0.5 then H(Θ) = 1

for eg, if H(Θ) = 0.7 means the subject has 70% chances of diabetes and 30% says it is not having diabetes.

If H(Θ) < 0.5 then H(Θ) = 0

for eg, if H(Θ) = 0.2 means the subject has 20% chances of diabetes and 80% says it is not having diabetes.

sigmoid function

Rather than computing loop to find Z = Θ0*X0(i) + Θ1*X1(i) we are going to implement Vectorization and following can be converted into:

Z = X*Θ

H(Θ) = g(Z)

The above image describes matrix conversion from the above diabetes data-set where X is the data containing all the feature examples and y is the target.Here X0(i)=1.

Prediction Function(predicted y)

We are going to find values of theta using gradient descent algorithm.Before running into gradient descent it is important to know cost function as it is going to be use in gradient descent.

Cost Function : To measure error between the actual y and (H(Θ)) predicted y we need a error metric i.e cost function J(Θ).

J(Θ) =-(1/m)* (y .T*log(hΘ(x (i) )) +(1 − y ).T* log(1 − hΘ(x (i) )))

T = Transpose of matrix

m : Number of Training examples (rows)

H(Θ) : predicted y

Gradient Descent: So we have our hypothesis or prediction function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters Θ in the hypothesis function. That’s where gradient descent comes in.

So, we are going to measure it by formula given as below:

Θ = Θ -alpha*grad

alpha : Learning rate

grad : (1/m) * ((H(Θ)-y).T*(X)).T

By doing partial derivative of J(Θ) w.r.t Θ :

grad = ∂J(Θ) ∕∂Θ = (1/m) * ((H(Θ)-y).T*(X)).T

T = Transpose of matrix

We will repeat (Θ = Θ -alpha*grad) until convergence or we can say until the cost J(Θ) becomes minimum.

J(theta) vs theta

alpha or learning rate : alpha is the hyper-parameter called learning rate. Learning rate should not to be too small or too big.For example if alpha is too large,gradient descent can overshoot the minimum.It may fail to converge,or even diverge.If alpha is too small,gradient descent can be slow.Usually trying different value of alpha like [0.001,0.01,0.1,1,0.003,0.03,0.3,3] we can get an idea which would be appropriate to choose.

If we want to know whether the algorithm is working or not, we will simply plot J(Θ) vs no. of iterations.

J(theta) or cost decreasing per iteration

If cost is decreasing with the iterations then our gradient descent algorithm is working fine.

So, far we have seen how logistic regression works and how it uses linear function and makes an activation between 0 to 1 using sigmoid function.

For Practical coding part we are going to implement logistics regression on Diabetes data-set whether the subject as a diabetes or not.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Here pandas is used for importing csv file,numpy is going to be use for matrix algebra part and matplotlib is going be use for plotting cost per iteration.

Now with the help of pandas read_csv function we are going to import the data

data = pd.read_csv("D:\\diabetes.csv")
data.head()

Here data.head() is going to display first 5 rows of the data.

Diabetes Patient Data

Here there are 9 columns and 768 rows in which from Pregnancies to Age columns are going to be feature matrix X and Outcome column to Y(the one we want to predict).

Now we are going to split our into into features(X) and target(Y).

X = data.drop(['Outcome'],axis=1)
X.head()

Here X contains all the features by dropping the Outcome column which is Target.

y = data['Outcome']

Now inserting Outcome column to Y.

To check how many patient has a diabetes or not we can simply visualize it with the help of sns.countplot

import seaborn as sns
sns.countplot(x=y)

These show the following graph:

These shows that there are approx 500 patients who is not having diabetes according to the report and approx 250 who is having diabetes.

Now we are going to convert our X and y variables into Matrix.As for right now the X and y variables are of type Data-Frames.So we need to covert them into matrix.

Following can be done to convert Data-Frame into matrix

X = np.matrix(X)
y = np.matrix(y).T
Output: X matrix-matrix([[ 6. , 148. , 72. , ..., 33.6 , 0.627, 50. ],
[ 1. , 85. , 66. , ..., 26.6 , 0.351, 31. ],
[ 8. , 183. , 64. , ..., 23.3 , 0.672, 32. ],
...,
[ 5. , 121. , 72. , ..., 26.2 , 0.245, 30. ],
[ 1. , 126. , 60. , ..., 30.1 , 0.349, 47. ],
[ 1. , 93. , 70. , ..., 30.4 , 0.315, 23. ]])
y matrix-matrix([[1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1,0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,..........,0,1,0,0,1,0, 0, 0, 1, 0]], dtype=int64)

Now I am going to use Feature Normalization on feature matrix.Feature normalization is used to enhance the gradient descent.So that computation time becomes less.

Feature Norm
mean = np.mean(X,axis = 0)
std = np.std(X,axis = 0)
X = (X - mean)/std

Above following code is feature norm for X matrix.

Now we are going to stack X0 which is having dimension (m,1) where m is the no. of training example.

m,n = X.shape
print("Dimension of X : ",X.shape)
print("Dimension of Y : ",Y.shape)
Output:Dimension of X : (768,8)
Dimension of Y: (768,1)
Here m is 768 and n features 8

Now adding X0 column to X

X0 = np.ones((m,1))
X = np.hstack((X0,X))
Output:X matrixmatrix([[ 1. , 0.63994726, 0.84832379, ..., 0.20401277,
0.46849198, 1.4259954 ],
[ 1. , -0.84488505, -1.12339636, ..., -0.68442195,
-0.36506078, -0.19067191],
[ 1. , 1.23388019, 1.94372388, ..., -1.10325546,
0.60439732, -0.10558415],
...,
[ 1. , 0.3429808 , 0.00330087, ..., -0.73518964,
-0.68519336, -0.27575966],
[ 1. , -0.84488505, 0.1597866 , ..., -0.24020459,
-0.37110101, 1.17073215],
[ 1. , -0.84488505, -0.8730192 , ..., -0.20212881,
-0.47378505, -0.87137393]])

Now the X shape will (m,n+1) i.e(768,9)

Now lets create Θ matrix i.e of shape (n+1,1)

theta=np.zeros((n+1,1))

Now we going to create sigmoid function i.e

def sigmoid(Z):
S = 1/(1 + np.exp(-Z))
return S

Now creating our prediction function i.e H(Θ) = g(Z)

def prediction(X,theta):
Z = X * theta
htheta = sigmoid(Z)
return htheta

Now creating a cost function to measure error

J(Θ) =-( 1/m )* ( y .T*log( hΘ( x (i) ) ) +( 1 − y ).T* log(1 − hΘ( x (i) ) ) )

def cost(pred,y):
J = (-1/m)*(y.T*np.log(pred) + (1 - y).T * np.log( 1 - pred ))
return J

Now creating gradient function for computing Θ.

grad = (1/m) * ((H(Θ)-y).T*(X)).T

def grad(X,prediction,y):
d_theta = (1/m) * (X.T * (prediction - y))
return d_theta

Now creating Logistic Regression Function to get all the function into one function.

def LogisticRegression(X,y,theta,alpha,epoch):

j_history=[]

for i in range(epoch):

pred = prediction(X,theta)
J = cost(pred,y)
g = grad(X,pred,y)
theta = theta - alpha*d_theta
j_history = np.append(j_history,J)

print("Epoch : ",i," ","cost : ",J)

x = np.linspace(0,epoch,epoch)
plt.ylabel("cost function")
plt.plot(x,j_history,color='r')
plt.xlabel("No. of iterations")
plt.title("Decreasing of cost function")

return theta,pred

As you can see above j_history is used to save cost per iteration to visualize cost vs iteration whether it decreasing or not.

theta,pred = LogisticRegression(X,y,theta,alpha=2,epoch=30)Output :Epoch :  0   cost :  [[0.69314718]]
Epoch : 1 cost : [[0.53342096]]
Epoch : 2 cost : [[0.50414229]]
Epoch : 3 cost : [[0.49056751]]
Epoch : 4 cost : [[0.48328456]]
Epoch : 5 cost : [[0.47902958]]
Epoch : 6 cost : [[0.47639795]]
Epoch : 7 cost : [[0.47470381]]
Epoch : 8 cost : [[0.47358081]]
Epoch : 9 cost : [[0.47281983]]
Epoch : 10 cost : [[0.47229528]]
Epoch : 11 cost : [[0.47192879]]
Epoch : 12 cost : [[0.47166992]]
Epoch : 13 cost : [[0.47148541]]
Epoch : 14 cost : [[0.47135292]]
Epoch : 15 cost : [[0.47125718]]
Epoch : 16 cost : [[0.47118761]]
Epoch : 17 cost : [[0.47113683]]
Epoch : 18 cost : [[0.47109961]]
Epoch : 19 cost : [[0.47107223]]
Epoch : 20 cost : [[0.47105202]]
Epoch : 21 cost : [[0.47103706]]
Epoch : 22 cost : [[0.47102597]]
Epoch : 23 cost : [[0.47101772]]
Epoch : 24 cost : [[0.47101156]]
Epoch : 25 cost : [[0.47100697]]
Epoch : 26 cost : [[0.47100353]]
Epoch : 27 cost : [[0.47100096]]
Epoch : 28 cost : [[0.47099902]]
Epoch : 29 cost : [[0.47099757]]
J(theta) vs iterations

The above block will call LogisticRegression function to fit the data(X,y) for training and to get theta and prediction values.

Now we are going to find accuracy of the model.

predict = prediction(X,theta)
print(predict)
for i in range(len(predict)):
if predict[i]>=0.50:
predict[i]=1
else:
predict[i]=0
k = np.double(predict == y)
acc = np.mean(k)*100
print(acc)
Output:
78.25520833333334

We can also use machine learning programming framework “sklearn”.

As we can see our model accuracy is 78.255% and sklearn models with 78.125% which is close to each other.

Let me explain you how Programming framework works,Firstly import LogisticRegression from linear model of sklearn

Create a object of its class

clf = LogisticRegression(max_iter=1000,random_state=0).fit(X, y) trains the data like we have created function LogisticRegression(X,y,theta,alpha=3,epoch=30).

clf.predict() is used for predictions

clf.score(X,y) gives accuracy between the actual y vs the predicted y.

For more information to use it you can visit https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Now we can conclude it with a simple process diagram as follow:

--

--