Python Implementation of Andrew Ng’s Machine Learning Course (Part 2.2)

Published in

Analytics Vidhya

6 min readSep 8, 2018

This has so far been an incredibly popular series and I’m grateful to all of you for reading it. Check the previous articles below (in case you haven’t covered them yet):

Single and Multivariate Linear Regression (part 1)
Logistic Regression or Classification (part 2.1)

Continuing our journey of the Pythonic version of Andrew Ng’s course, in this blog post we’ll learn about Regularized Logistic Regression.

Pre-requisites
It’s highly recommended that first you watch the week 3 video lectures and complete the in-video quizzes.
Should have basic familiarity with the Python ecosystem.

Regularized logistic regression

Problem context
You will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly.
Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.

First let’s load the necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt    # more on this later

Next, we read the data (the necessary data is available under week-3 content)

data = pd.read_csv('ex2data2.txt', header = None)
X = data.iloc[:,:-1]
y = data.iloc[:,2]
data.head()

So we have two independent features and one dependent variable. Here 0 means the chip has been rejected and 1 means accepted.

Visualizing the data

Before starting to implement any learning algorithm, it is always good to visualize the data if possible.

mask = y == 1
passed = plt.scatter(X[mask][0].values, X[mask][1].values)
failed = plt.scatter(X[~mask][0].values, X[~mask][1].values)
plt.xlabel('Microchip Test1')
plt.ylabel('Microchip Test2')
plt.legend((passed, failed), ('Passed', 'Failed'))
plt.show()

Above figure shows that our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straight-forward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary.

Feature mapping

One way to fit the data better is to create more features from each data point. Hence we will map the features into all polynomial terms of x1 and x2 up to the sixth power.

As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot.

While the feature mapping allows us to build a more expressive classifier, it is also more susceptible to over fitting. In the next parts of the exercise, you will implement regularized logistic regression to fit the data and also see for yourself how regularization can help combat the over fitting problem.

def mapFeature(X1, X2):
    degree = 6
    out = np.ones(X.shape[0])[:,np.newaxis]
    for i in range(1, degree+1):
        for j in range(i+1):
            out = np.hstack((out, np.multiply(np.power(X1, i-j),                                     np.power(X2, j))[:,np.newaxis]))
    return outX = mapFeature(X.iloc[:,0], X.iloc[:,1])

Implementation

Before you start with the actual cost function, recall that the logistic regression hypothesis makes use of sigmoid function. Let’s define our sigmoid function.

Sigmoid Function

def sigmoid(x):
  return 1/(1+np.exp(-x))

Cost Function

As usual lets code our cost function and gradient functions.

def lrCostFunction(theta_t, X_t, y_t, lambda_t):
    m = len(y_t)
    J = (-1/m) * (y_t.T @ np.log(sigmoid(X_t @ theta_t)) + (1 - y_t.T) @ np.log(1 - sigmoid(X_t @ theta_t)))
    reg = (lambda_t/(2*m)) * (theta_t[1:].T @ theta_t[1:])
    J = J + reg
    return J

There are multiple ways to code cost function. Whats more important is the underlying mathematical ideas and our ability to translate them into code.

Gradient Function

def lrGradientDescent(theta, X, y, lambda_t):
    m = len(y)
    grad = np.zeros([m,1])
    grad = (1/m) * X.T @ (sigmoid(X @ theta) - y)
    grad[1:] = grad[1:] + (lambda_t / m) * theta[1:]
    return grad

Let’s call these functions using the initial parameters.

(m, n) = X.shape
y = y[:, np.newaxis]
theta = np.zeros((n,1))
lmbda = 1J = lrCostFunction(theta, X, y, lmbda)
print(J)

This gives us a values of 0.69314718 .

Learning parameters using fmin_tnc

Similar to previous post we will make use of fmin_tnc

fmin_tnc is an optimization solver that finds the minimum of an unconstrained function. For logistic regression, you want to optimize the cost function with the parameters theta.

output = opt.fmin_tnc(func = lrCostFunction, x0 = theta.flatten(), fprime = lrGradientDescent, \
                         args = (X, y.flatten(), lmbda))
theta = output[0]
print(theta) # theta contains the optimized values

Note on flatten() function: Unfortunately scipy’s fmin_tnc doesn’t work well with column or row vector. It expects the parameters to be in an array format. The flatten() function reduces a column or row vector into array format.

Accuracy of model

Lets try to find the model accuracy by predicting the outcomes from our learned parameters and then comparing with the original outcomes.

pred = [sigmoid(np.dot(X, theta)) >= 0.5]
np.mean(pred == y.flatten()) * 100

This gives our model accuracy as 83.05% .

Plotting Decision Boundary (optional)

To help you visualize the model learned by this classifier, we will plot the (non-linear) decision boundary that separates the positive and negative examples. We plot the non-linear decision boundary by computing the classifier’s predictions on an evenly spaced grid and then drew a contour plot of where the predictions change from y = 0 to y = 1.

u = np.linspace(-1, 1.5, 50)
v = np.linspace(-1, 1.5, 50)
z = np.zeros((len(u), len(v)))def mapFeatureForPlotting(X1, X2):
    degree = 6
    out = np.ones(1)
    for i in range(1, degree+1):
        for j in range(i+1):
            out = np.hstack((out, np.multiply(np.power(X1, i-j), np.power(X2, j))))
    return outfor i in range(len(u)):
    for j in range(len(v)):
        z[i,j] = np.dot(mapFeatureForPlotting(u[i], v[j]), theta)mask = y.flatten() == 1
X = data.iloc[:,:-1]
passed = plt.scatter(X[mask][0], X[mask][1])
failed = plt.scatter(X[~mask][0], X[~mask][1])
plt.contour(u,v,z,0)
plt.xlabel('Microchip Test1')
plt.ylabel('Microchip Test2')
plt.legend((passed, failed), ('Passed', 'Failed'))
plt.show()

Our model has done pretty good job at classifying the various data points.

Also try changing the values of lambda to see for yourself how the decision boundary changes.

Thanks for making it thus far. If you liked my work give me a clap (or several claps).

The next article in this series is going to be really interesting because we will build a model that recognizes hand-written digits.

Python Implementation of Andrew Ng’s Machine Learning Course (Part 2.2)

Written by Srikar