A Guide to using Logistic Regression for Digit Recognition (with Python codes)

Published in

Analytics Vidhya

5 min readSep 20, 2018

The first classification technique any aspiring data scientist comes across is usually logistic regression. In fact a lot of banking services still use it, despite the rise of more powerful methods like random forests. It’s not all that surprising however, given how straightforward and interpretable logistic regression is.

But could you imagine using it for a computer vision task?

In this post, we will learn how a simple algorithm like Logistic Regression can be used to recognise handwritten digits (0–9) using a technique called one-vs-all classification. Along the way, we will also learn about vectorization and its benefits.

This blog post is inspired from Andrew Ng’s Machine Learning course problem set 3 (the dataset can be obtained here).

If you’re new to this field, make sure to go through my below posts as well:

Single and Multivariate Linear Regression (part 1)
Logistic Regression or Classification (part 2.1)
Regularized Logistic Regression (part 2.2)

Understanding One-vs-all Classification

If there are K different classes in a data-set, we will first build a model where we consider the data belonging to one class as positive, and all other classes as negative. Next, we will build another model assuming the data belonging to some other class as positive and the remaining as negative. We shall keep repeating the process until we build K different models.

Let us understand this better with the help of an example. In the below figure, we have data belonging to 3 different classes. Hence we will build 3 different models considering one particular class of data as positive and the remaining as negative.

In general, if there are K classes in the data-set, we need to build K different models.

Borrowed from Andrew Ng Machine Learning course (Coursera)

One-vs-all using Logistic Regression

The data-set consists of digits from 0 to 9, so we have 10 different classes here. We will make use of the one-vs-all classification technique by training 10 different logistic regression classifiers as mentioned above.

First, let’s load the necessary libraries.

from scipy.io import loadmat
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt

Reading the data

data = loadmat('ex3data1.mat')
X = data['X']
y = data['y']

The dataset has 5000 training examples where each example is a 20-by-20 pixel grey scale image unrolled into a 400-dimensional vector thus forming a 5000 by 400 matrix X. Also note that the digit 0 is labeled as 10 while the digits 1–9 are labeled as 1–9 in the labeling vector y.

Visualizing the data

_, axarr = plt.subplots(10,10,figsize=(10,10))
for i in range(10):
    for j in range(10):
       axarr[i,j].imshow(X[np.random.randint(X.shape[0])].\
reshape((20,20), order = 'F'))          
       axarr[i,j].axis('off')

Adding the intercept term

m = len(y)
ones = np.ones((m,1))
X = np.hstack((ones, X)) #add the intercept
(m,n) = X.shape

Vectorization

According to Andrew Ng, “Vectorization is the art of getting rid of explicit for-loops in code”. We as data scientists work with huge amounts of data. Using for-loops while working on such huge data is highly inefficient. Hence, we make use of the vectorization technique which avoids the use of for-loops and also improves efficiency and speed of calculations.

For example, let’s consider two 1-d arrays - a and b having a million elements each. To demonstrate the speed at which Vectorization vs for-loops operate, we perform element wise multiplication of both the arrays and sum the elements in resulting array and compare the time difference.

import numpy as np
import timea = np.random.rand(1000000)
b = np.random.rand(1000000)c = 0
tic = time.time()
for i in range(1000000):
  c += a[i] * b[i]
toc = time.time()
print("value of c {0:.5f}".format(c))
print("time taken using for-loop " + str(1000*(toc-tic)) + " ms")c = 0
tic = time.time()
c = np.dot(a,b) # no for-loops in vectorized version
toc = time.time()
print("value of c {0:.5f}".format(c))
print("time taken using vectorized operation " + str(1000*(toc-tic)) + " ms")

value of c 249740.84172
time taken using for-loop 431.77247047424316 ms
value of c 249740.84172
time taken using vectorized operation 1.9989013671875 ms

As we can gauge from the above output, the vectorized version is 200 times faster than a for-loop in this case.

Vectorizing Logistic Regression

Using a vectorized version of Logistic Regression is much more efficient than using for-loops, particularly when the data is heavy. In this exercise, we are going to avoid using for-loops by implementing vectorized Logistic Regression.

Since we know that Logistic Regression uses sigmoid function, we will implement this first:

def sigmoid(z):
    return 1/(1+np.exp(-z))

Vectorized Cost Function:

def costFunctionReg(theta, X, y, lmbda):
    m = len(y)
    temp1 = np.multiply(y, np.log(sigmoid(np.dot(X, theta))))
    temp2 = np.multiply(1-y, np.log(1-sigmoid(np.dot(X, theta))))
    return np.sum(temp1 + temp2) / (-m) + np.sum(theta[1:]**2) * lmbda / (2*m)

Vectorized gradient:

def gradRegularization(theta, X, y, lmbda):
    m = len(y)
    temp = sigmoid(np.dot(X, theta)) - y
    temp = np.dot(temp.T, X).T / m + theta * lmbda / m
    temp[0] = temp[0] - theta[0] * lmbda / m
    return temp

As you can see from the above, we have avoided the use of for-loops and also added the regularization term to take care of over-fitting.

Optimizing Parameters

Here we will make use of an advanced numerical optimization library function calledfmin_cg from thescipy library to find the optimal values for our parameters.

lmbda = 0.1
k = 10
theta = np.zeros((k,n)) #inital parametersfor i in range(k):
    digit_class = i if i else 10
    theta[i] = opt.fmin_cg(f = costFunctionReg, x0 = theta[i],  fprime = gradRegularization, args = (X, (y == digit_class).flatten(), lmbda), maxiter = 50)

Since we have 10 different models, we needed to find the optimal parameters for each model by using a for-loop.

Making Predictions using the One-vs-all Technique

After training the one-vs-all classifier, we can now use it to predict the digit contained in a given image. For each input, you should compute the “probability” that it belongs to each class using the trained logistic regression classifiers. We will pick the class for which the corresponding logistic regression classifier outputs the highest probability and return the class label (1, 2,…, or K) as the prediction for the input example. We then use the returned prediction vector to find the model accuracy.

pred = np.argmax(X @ theta.T, axis = 1)
pred = [e if e else 10 for e in pred]
np.mean(pred == y.flatten()) * 100

This should give us an accuracy of 95.08%. Impressive! Our model has done a very good job at predicting the digits.

End Notes

We have learned how a simple algorithm like Logistic Regression can be used to perform complex tasks like digit recognition, and in the process also got to know about the one-vs-all technique and vectorization.

Thanks for making it this far. If you liked my work, give me a clap (or several claps). In the next post we will learn about Neural Networks. Stay tuned!