Logistic Regression

Dasaprakash K
7 min readMar 10, 2018

--

Introduction

While Linear Regression helps us find the correlation of continuous functions, Logistic Regression deals with discrete or classification problems. A discrete variable has a countable set of values.

In Linear Regression we would predict the $ amount for which the house can sold whereas in Logistic Regression we will predict if the house can be sold or not based on the input variables.

There are two types of classification:

  1. Binary Classification (Titanic dataset from Kaggle)
  2. Multi-class classification (MNIST Dataset from Kaggle)

Logistic Function — Sigmoid

In Logistic Regression, we will apply the linear approximation as we did in Linear Regression.

f = Wx + b

An then we will apply a Logistic function like sigmoid function to do a binary classification.

σ(x) returns probability of the linear approximation function “f” defined above and the values will be between 0 and 1.

Sigmoid Graph

As we see the graph, we can easily spot the difference between linear approximation function and logistic function. f was linear whereas sigmoid is a non-linear function.

#sigmoid = 1 / 1+e-z
def sigmoid(self, X, theta):
z = np.dot(X, theta.T)
return 1 / (1 + np.exp(-z))

Decision boundary

The sigmoid function returns the probability between 0 and 1.

For f = 0, sigmoid will return 0.5

For f = 1, sigmoid will return 0.73

For f = -1, sigmoid will return 0.26

To return the probabilities of discrete values, we decide upon the threshold like 0.5 in this case, above which we classify the scores as admitted and below as not admitted.

s = sigmoid(x)

s ≥ 0.5; admitted = 1

s < 0.5; admitted = 0

Data

We will use a dataset from Coursera ML course by Andrew Ng. The dataset contains the features Exam 1 score and Exam 2 score. Based on the features we have the label of whether the student is admitted to the university or not.

def plotData(X, Y):
plt.scatter(X[:, 0], X[:, 1], c=Y, s=100, alpha=0.5, cmap='coolwarm')
plt.show()
def normalize(X):
for i in range(X.shape[1]-1):
X[:, i] = (X[:, i] - np.mean(X[:, i]))/np.std(X[:, i])
return X
df = pd.read_csv('Logistic_Regression.txt', sep=',', header=None)
df.insert(2, 'Bias', np.ones(df.shape[0]))
X = df.iloc[:, :-1]
Y = df.iloc[:, -1]
X = np.array(X)
Y = np.array(Y)
plotData(X, Y)
X = normalize(X)
Scatter plot of dataset

Cost Function

Since we are dealing with sigmoid function, which is non-linear we cannot use the error function which we used for Linear Regression here. We will use another error function called Cross-Entropy or log loss. Squaring the sigmoid predictions will result in a non-convex function with many local minimums. Gradient descent may not find the optimal global minimum in case of many local minimums and the learning will be too slow. To know more about cross entropy, visit the link below:

Logistic Cost Function

Let’s analyze the cost function with some inputs:

For Y_Label = 1, Y_Pred = 0.9; J = 0.04

For Y_Label = 1, Y_Pred = 0.5; J = 0.3

For Y_Label = 1, Y_Pred = 0.15; J = 0.8

Based on the above values, we can observe the cross-entropy function’s penalty for making wrong predictions are more than making right predictions. Adding images from Andrew Ng cost function of Logistic regression to graphically represent the above analysis.

We can represent the above cost function in a single formula as below which will perform both the cases by cancelling out one or another based on the label values 0 or 1.

#cross entropy J = -Sigma(Ylog(P) + (1-Y)log(1-P))
def cost_function(self, Y, P):
l1 = np.log(P)
l2 = 1- np.log(P)
return np.sum(-Y*l1 - (1-Y)*l2)/Y.size

Gradient Descent

So far we have seen the sigmoid function and the cost function for the non-linear sigmoid function and how the cost function improves the learning rate by penalising the wrong predictions.

We will derive the derivatives of cross-entropy function below and apply it to gradient descent.

Cost Function: J = -Σ(Ylog(P) + (1-Y)log(1-P))/N

∂J/∂Wi = ∂J/∂P*∂P/∂z*∂z/∂Wi

∂J/∂P = -Y/P + (1-Y)/(1-P)

∂P/∂z = P(1-P)

∂z/∂Wi = Xi

After simplification:

∂J/∂Wi = (P-Y)Xi

This operation can be vectorized and represented as:

∂J/∂Wi = X.T * (P-Y)

∂J/∂b = (P-Y)

The derivatives of cost function for W and b can be used for optimizing the cost function J

loop until convergence{

W = W- η∂J/∂W

b = b- η∂J/∂b

}(update W & b simultaneously)

η — learning rate to update weights. We will see more about choosing learning rates in Neural Networks.

Remember, choosing larger learning rate might overlook the convergence and smaller learning rate will make the model learn slow.

def gradient_descent(self, X, Y, theta, alpha, num_iter):
J = []
for i in range(num_iter):
P = self.sigmoid(X, theta)
theta = theta - alpha * np.dot(X.T, (P-Y))/Y.size
if i%100 == 0:
cost = self.cost_function(Y, P)
rate = self.score(Y, P)
print('Iteration: ' + str(i), 'Cost: ' + str(cost), 'Accuracy: ' + str(rate))
J.append(cost)
return theta, J

Training

The training of Logistic Regression is similar to Linear Regression.

  1. Initializing weights (W, b)
  2. Initializing the learning rate (η) hyper parameter
  3. Setting the number of iterations
  4. Normalize the input features
  5. Update weights and bias using Gradient Descent till cost is minimized
  6. Return updated weights and cost history
model = Logistic_Regression()
model.plotData(X, Y)
X = normalize(X)theta = np.zeros(3)
alpha = 0.01
num_iter = 400
theta, J = model.fit(X, Y, theta, alpha, num_iter)
P = model.predict(X)
model.plot_Decision_Boundary(X, Y)
Decision Boundary
def plot_Decision_Boundary(self, X, Y):
plt.scatter(X[:, 0], X[:, 1], c=Y, s=100, alpha=0.5, cmap='coolwarm')
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
x = np.linspace(-2, 2, 100)
y = -(self.theta[0] * x + self.theta[2]) / self.theta[1]
plt.plot(x, y, 'r')
plt.show()

Model Evaluation

The model evaluation can be done with following steps:

  1. Round the predicted values from sigmoid function
  2. Take the mean of absolute values of difference between labels and predictions
def score(self, Y, P):
return 1 - np.mean(np.abs(np.round(P) - Y))

Logistic Regression with Regularization

Let’s use another dataset from Coursera ML course by Andrew Ng. The dataset contains test results for some microchips on two different tests. The model should help us determine whether the microchips should be accepted or rejected.

def plotData(self, X, Y):
plt.scatter(X[:, 0], X[:, 1], c=Y, s=118, alpha=0.5, cmap='coolwarm')
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.show()

A linear fit for this data set will result in high bias or underfitting.

Underfitting(High Bias)
def mapfeature(self, X, order):
poly = PolynomialFeatures(order)
return poly.fit_transform(X)
def plotDecisionBoundary(self, X, Y, W, b, order):
plt.scatter(X[:, 0], X[:, 1], c=Y, s=118, alpha=0.5, cmap='coolwarm')
plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
dim = np.linspace(-1, 1.5, 1000)
x, y = np.meshgrid(dim, dim)
poly = self.mapfeature(np.column_stack((x.flatten(), y.flatten())), order)
z = (np.dot(poly, W) + b).reshape(1000, 1000)
plt.contour(x, y, z, levels=[0], colors=['r'])
plt.show()
model.plotDecisionBoundary(X, Y, W, b, order=1)

By adding extra features of higher order polynomial a better fit can be realised. But we may end up getting into high variance or overfitting. To avoid this, let’s review a concept called regularization which lets us keep all the features, but reduce the magnitude of parameters Wj. Regularization prevents the learning algorithm to overfit the training data or from picking arbitrarily large parameter values.

The two types of regularizations are l2 (Ridge) and l1(Lasso). The main difference between l2 and l1 is l2 minimizes the impact of irrelevant features on the trained model while l1 regularization removes the irrelevant features and thereby dealing with few features.

We will improve the model by adding new polynomials of the inputs upto order 6 and applying l2 regularization.

Cost Function with Regularization

Cost function with l2 regularization has an additional product of sum of squares of weights and lambda parameter or regularization parameter. Be sure to use the derivative of the same in Gradient descent. With l1 instead of using sum of squares of weights, absolute value of weights will be used.

Note: Bias term should be excluded from regularization.

def cost_function(self, Y, P, lambda_rate, W):
l1 = np.log(P)
l2 = np.log(1-P)
reg = lambda_rate*np.sum(W**2)/(2*Y.size)
return -np.sum(Y*l1 + (1-Y)*l2)/Y.size + reg
def gradient_descent(self, X, Y, W, b, alpha, lambda_rate, epochs):
J = []
for i in range(epochs):
P = self.sigmoid(X, W, b)
W = W * (1 - lambda_rate*alpha/Y.size)
W = W - alpha * np.dot(X.T, P-Y)
b = b - alpha * np.sum((P-Y), axis=0)
if i%100 == 0:
cost = self.cost_function(Y, P, lambda_rate, W)
J.append(cost)
rate = self.score(Y, P)
print('Epoch: ' + str(i), 'Cost: ' + str(cost), 'Accuracy: ' + str(rate))
return W, b, J
Classification with Regularization
Classification with over-regularization

By adding features of higher polynomial and proper regularization rate the classification model with performing with an accuracy of 83%.

I will publish my next blog on Artificial Neural Networks where multi-class classification of MNIST dataset using Logistic Regression and ANN will be compared.

--

--