Python Implementation of Andrew Ng’s Machine Learning Course (Part 2.1)
In my previous post we had discussed about Pythonic implementation of Linear Regression with Single and Multiple independent variables as part of week 1 and week 2 programming assignment. Now we will move to week 3 content i.e., Logistic Regression.
Now since this is going to be a pretty lengthy post I am going to divide this post into two parts. Watch out for Part 2.2 that looks into how to combat overfitting problem.
If you are new here I would encourage you to read my previous post
Python Implementation of Andrew Ng’s Machine Learning Course (Part 1)
Pre-requisites
It’s highly recommended that first you watch the week 3 video lectures.
Should have basic familiarity with the Python ecosystem.
Here we will look into one of the most widely used ML algorithm in the industry.
Logistic Regression
In this part of the exercise, you will build a logistic regression model to predict whether a student gets admitted into a university.
Problem context
Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision.
Your task is to build a classification model that estimates an applicant’s probability of admission based on the scores from those two exams.
First let’s load the necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt # more on this later
Next, we read the data (the necessary data is available under week-3 content)
data = pd.read_csv('ex2data1.txt', header = None)
X = data.iloc[:,:-1]
y = data.iloc[:,2]
data.head()
So we have two independent features and one dependent variable. Here 0
means candidate was unable to get an admission and 1
vice-versa.
Visualizing the data
Before starting to implement any learning algorithm, it is always good to visualize the data if possible.
mask = y == 1
adm = plt.scatter(X[mask][0].values, X[mask][1].values)
not_adm = plt.scatter(X[~mask][0].values, X[~mask][1].values)
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend((adm, not_adm), ('Admitted', 'Not admitted'))
plt.show()
Implementation
Before you start with the actual cost function, recall that the logistic regression hypothesis makes use of sigmoid function. Let’s define our sigmoid function.
Sigmoid Function
def sigmoid(x):
return 1/(1+np.exp(-x))
Note that here we are writing the vectorized code. So it really doesn’t matter whether x
is a scalar or a vector or a matrix or a tensor ;-). Of course writing and understanding the vectorized code takes some mind bending (which anyone will become good at after some practice). However, it gets rid of for loops and also makes for efficient and generalized code.
Cost Function
Let’s implement the cost function for the Logistic Regression.
def costFunction(theta, X, y):
J = (-1/m) * np.sum(np.multiply(y, np.log(sigmoid(X @ theta)))
+ np.multiply((1-y), np.log(1 - sigmoid(X @ theta))))
return J
Note that we have used the sigmoid
function in the costFunction
above.
There are multiple ways to code cost function. Whats more important is the underlying mathematical ideas and our ability to translate them into code.
Gradient Function
def gradient(theta, X, y):
return ((1/m) * X.T @ (sigmoid(X @ theta) - y))
Note that while this gradient looks identical to the linear regression gradient, the formula is actually different because linear and logistic regression have different definitions of hypothesis functions.
Let’s call these functions using the initial parameters.
(m, n) = X.shape
X = np.hstack((np.ones((m,1)), X))
y = y[:, np.newaxis]
theta = np.zeros((n+1,1)) # intializing theta with all zerosJ = costFunction(theta, X, y)
print(J)
This should give us a value of 0.693
for J.
Learning parameters using fmin_tnc
In the previous assignment, we found the optimal parameters of a linear regression model by implementing the gradient descent algorithm. We wrote a cost function and calculated its gradient, then took a gradient descent step accordingly. This time, instead of taking the gradient descent steps, we will use a built-in function fmin_tnc
from scipy
library.
fmin_tnc
is an optimization solver that finds the minimum of an unconstrained function. For logistic regression, you want to optimize the cost function with the parameters theta
.
Constraints in optimization often refer to constraints on the parameters. For example, constraints that bound the possible values
theta
can take (e.g.,theta
≤ 1). Logistic regression does not have such constraints sincetheta
is allowed to take any real value.
Concretely, you are going to use fmin_tnc
to find the best or optimal parameters theta
for the logistic regression cost function, given a fixed dataset (of X and y values). You will pass to fmin_tnc
the following inputs:
- The initial values of the parameters we are trying to optimize.
- A function that, when given the training set and a particular
theta
, computes the logistic regression cost and gradient with respect totheta
for the dataset (X, y).
temp = opt.fmin_tnc(func = costFunction,
x0 = theta.flatten(),fprime = gradient,
args = (X, y.flatten()))
#the output of above function is a tuple whose first element #contains the optimized values of theta
theta_optimized = temp[0]
print(theta_optimized)
Note on
flatten()
function: Unfortunatelyscipy’s fmin_tnc
doesn’t work well with column or row vector. It expects the parameters to be in an array format. Theflatten()
function reduces a column or row vector into array format.
The above code should give [-25.16131862, 0.20623159, 0.20147149]
.
If you have completed the costFunction
correctly, fmin_tnc
will converge on the right optimization parameters and return the final values of theta
. Notice that by using fmin_tnc
, you did not have to write any loops yourself, or set a learning rate like you did for gradient descent. This is all done by fmin_tnc
:-) You only needed to provide a function for calculating the cost and the gradient.
Lets use these optimized theta
values to calculate the cost.
J = costFunction(theta_optimized[:,np.newaxis], X, y)
print(J)
You should see a value of 0.203
. Compare this with the cost 0.693
obtained using initial theta
.
Plotting Decision Boundary (Optional)
This final theta
value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the one below.
plot_x = [np.min(X[:,1]-2), np.max(X[:,2]+2)]
plot_y = -1/theta_optimized[2]*(theta_optimized[0]
+ np.dot(theta_optimized[1],plot_x)) mask = y.flatten() == 1
adm = plt.scatter(X[mask][:,1], X[mask][:,2])
not_adm = plt.scatter(X[~mask][:,1], X[~mask][:,2])
decision_boun = plt.plot(plot_x, plot_y)
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend((adm, not_adm), ('Admitted', 'Not admitted'))
plt.show()
It looks like our model does a pretty good job at distinguishing the students who got the admission vs those who didn’t. Now lets quantify our model accuracy for which we will write a function rightly called accuracy
def accuracy(X, y, theta, cutoff):
pred = [sigmoid(np.dot(X, theta)) >= cutoff]
acc = np.mean(pred == y)
print(acc * 100)accuracy(X, y.flatten(), theta_optimized, 0.5)
This should give us an accuracy score of 89%
. Hmm… not bad.
You now have learnt how to perform Logistic Regression. Well done!
That’s it for this post. Give me a clap (or several claps) if you liked my work.
You can find the next post in this series here.