‘Machine Learning’ course: Recoding with Python — Part5: One-vs-All Classification & Neural Network

7 min readJun 19, 2022

Image Ref: ”https://www.freepik.com/vectors/python-language">Python language vector created by svstudioart — www.freepik.com

This is the 5th article in this series where I try to recode the exercises in the (old) Machine Learning course by Andrew Ng (where the programming exercises are done using Octave). My intention in writing these articles is to assist the learners of this course to use Python as an alternative while doing the exercises. Please also feel free to explore the previous parts in this series too:
Part1: Linear Regression model with one feature
Part2: Linear Regression with multiple features
Part3: (Unregularized) Logistic Regression model
Part4: Regularized Logistic Regression

For this exercise, we are given a dataset containing 5000 training examples of handwritten digits where each training example is a 20pixel by 20pixel grayscale image of the digit. Each pixel is represented by a floating-point number indicating the grayscale intensity at that location. So, our training dataset is a 5000 by 400 matrix. The second part of the training set is a 5000-dimensional vector y that contains labels for the training set. There are a total of 10 classes (‘1’, ‘2’, ‘3’, …., ‘10’). Please note that the ‘0’ digit is labeled as ‘10’ while the digits ‘1’ to ‘9’ are labeled as ‘1’ to ‘9’ in their natural order.

1. Getting to know the data

Let’s first take a look at the dataset given.

# load the dataimport scipy.io
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltdata = scipy.io.loadmat('ex3data1.mat')
data

‘X’ has all the feature columns. ‘y’ has the target column. Both contain 5000 training examples.

To visualize the data, I will create a function called displayData() which will take two parameters: the training dataset and the number of subplots (on one axis). P.S. For now, you can ignore the first part of the ‘if’ statement which is to display the predicted result in the last part of the article.

def displayData(X, subplot):
    
    width = int(round(math.sqrt(X.shape[1])))
    m, n = X.shape
    height = int(n/width)
    
    # create subplots
    fig, axarr = plt.subplots (subplot,subplot, figsize=(6,6),
                              gridspec_kw = {'wspace':0, 'hspace':0})
    
    # this part is for visualizing the training and prediciton in the last part.
    if subplot == 1:
        pixels = X
        pixels = pixels.reshape(width, height)
        axarr.imshow(pixels.T, cmap = 'gray_r')
        axarr.set_xticks([]) # remove the ticks
        axarr.set_yticks([])
    
    # this part is for showing random digits from X
    else:
        for i in range(subplot):
            for j in range(subplot):
                random_index = np.random.choice(len(X))
                pixels = X[random_index]   
                pixels = pixels.reshape(width, height)
                axarr[i,j].imshow(pixels.T, cmap ='gray_r')
                axarr[i,j].set_xticks([]) # remove the ticks
                axarr[i,j].set_yticks([])
    plt.show()X = data['X']
y = data['y']
displayData(X, 10)

This will randomly select 100 rows from X and display them in a figure with 100 subplots as below:

2. Cost Function and Gradient (with Regularization)

The dataset contains more than two labels, and therefore, we will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes (labels ‘1’ to ‘10’), we will train 10 separate logistic regression classifiers. Before implementing the multi-class classifier, let’s first define the necessary functions: the sigmoid function, Cost function, and gradient function.

Regularized Cost Function

Here, for both the cost function and gradient function, don’t forget to add a column of 1s to X for the bias term. While finding the gradient of the cost, please take note that we are not to regularize the theta value for the bias term. Please refer to Part4 for the mathematical equations for both functions.

# this function will compute cost of using theta as the parameter for regularized logistic regressiondef sigmoid(z):
    g = 1/(1 + np.exp(-z))
    return gdef J(theta, x, y, lambdaa):
    J = 0
    m = len(y)
    theta = theta.reshape((theta.shape[0],1))
    
    x = np.concatenate((np.ones((len(x),1)), x), axis = 1)  # add ones to the X: (5000x401) matrix
    
    y_prime = np.transpose(y)
    x_prime = np.transpose(x)
    h_theta = sigmoid(np.dot(x,theta))
    
    term_1 = (np.dot(y_prime, np.log(h_theta)))
    term_2 = (np.dot(np.transpose(1-y), np.log(1-h_theta)))
    term_3 = np.array(0.5*lambdaa*np.sum(np.power(theta[1::],2)))
    J     = (-term_1 -term_2 + term_3)/m
    J = np.sum(J)
    return J

Gradient of the Cost (Regularized)

def Gradient(theta, x, y, lambdaa):
    m = len(y)
    theta = theta.reshape((theta.shape[0],1))
    
    x = np.concatenate((np.ones((len(x),1)), x), axis = 1)  # add ones to the X: (5000x401) matrix
    
    j = np.ones((x.shape[1],1))
    j[0] = 0 #we will not regularize for j=0    grad = (np.dot(np.transpose(x), sigmoid(np.dot(x,theta))-y ) +  lambdaa* np.multiply(j,theta))/m
    return grad

3. One-vs-All Classification

We will implement one-vs-all classification by training 10 regularized logistic regression classifiers, one for each of the K classes (in our dataset, K = 10).
All the classifier parameters will be returned in a matrix as below where each row corresponds to the learned logistic regression parameters for one class.

For our case, the below function will return a (10,401) matrix where each row represents theta values for each label (e.g., row with index0 represents theta values for label ‘1’, index1 for label ‘2’,…, index9 for label ‘10’).

import scipy.optimize as opt
from scipy.optimize import minimizedef oneVsall(X,y,labels,lambdaa):
    m = X.shape[0]  # no. of training examples: 5000
    n = X.shape[1]  # no. of features: 400
    all_theta = np.zeros((labels,n+1)) 
    
    for i in range(1, labels+1):
        y_ = (y==i).astype(int)
        theta_start = np.zeros((n+1,1)) 
        
        theta_ = opt.fmin_tnc(func=J, x0=theta_start, fprime = Gradient, args = (X,y_,lambdaa)) #-> result
        
        all_theta[i-1,:] = theta_[0]
    return all_thetalabels = 10 
lambdaa = 0.1
result = oneVsall(X,y,labels,lambdaa)

4. Prediction

We can now use our one-vs-all classifier to predict the digit contained in a given image. For each input, we will compute the ‘probability’ that it belongs to each class using the trained classifiers. The one-vs-all prediction function will pick the class for which the corresponding logistic regression classifier outputs the highest probability and return the class label (1,2,…,10) as the prediction.

# the labels are in the range (1 to K) where K = 10.def predict(all_theta, x):
    labels = all_theta.shape[0] 
    
    h_theta = sigmoid(np.dot(x,np.transpose(all_theta)))  
    hmax    = np.amax(h_theta, axis=1)   
    prediction = np.argmax(h_theta, axis =1 )+1  
    prediction = prediction.reshape((prediction.shape[0],1)) # just reshaping into similar shape as y
    
    return predictionX_ = np.concatenate((np.ones((len(X),1)), X), axis = 1)  # add ones to the X: (5000x401) matrix
pred = predict(result, X_)print("Training set accuracy : {}%".format(np.mean((pred == y).astype(float))*100))

The resulted ‘pred’ is a (5000,1) matrix containing all the predicted results for the 5000 training examples. The training set accuracy obtained is 96.46%.

5. Multi-class Logistic Regression vs. Neural Network

We have implemented multi-class logistic regression to recognize handwritten digits. However, logistic regression cannot form more complex hypotheses as it is only a linear classifier (we can add more features such as polynomial features to logistic regression, but that can be very expensive to train). So, we will now implement a neural network to recognize handwritten digits using the same training set before. The neural network will be able to represent complex models that form non-linear hypotheses.

For this part of the exercise, we are given the weights (already trained) and we just need to implement the feedforward propagation algorithm for prediction.

Our neural network has 3 layers — an input layer, a hidden layer, and an output layer. We have 400 units in our input layer (excluding the extra bias unit which always outputs +1). We have a set of trained parameters for ϴ(1) and ϴ(2) (which are stored in ex3weights.mat). The parameters have dimensions that are sized for a neural network with 25 units in the second layer and 10 units in the output layer (corresponding to 10 classes).

weights = scipy.io.loadmat('ex3weights.mat')
Theta1 = weights['Theta1']
Theta2 = weights['Theta2']

Here, Theta1 is a (25,401) matrix and Theta2 is a (10,26) matrix.

5.1 Feedforward Propagation and Prediction

# This function will predict the label of an input given a trained neural networkdef predict2(theta1, theta2, x):
    x = np.concatenate((np.ones((len(x),1)), x), axis = 1)  # add ones to the X: (5000x401) matrix
    
    m = x.shape[0]
    labels = theta2.shape[0] 
    
    a1 = x  # we have already added bias colunm to x 
    
    z2 = np.dot(a1, np.transpose(theta1))
    a2 = sigmoid(z2)
    a2 = np.concatenate((np.ones((m,1)), a2), axis=1)    
    z3 = np.dot(a2, np.transpose(theta2)) 
    a3 = sigmoid(z3)
    
    hmax    = np.amax(a3, axis=1)   
    prediction = np.argmax(a3, axis =1 ) + 1 
    prediction = prediction.reshape((prediction.shape[0],1)) # just reshaping into similar shape as y
    
    return predictionpred = predict2(Theta1, Theta2, X)
print("Training set accuracy : {}%".format(np.mean((pred == y).astype(float))*100))

The training set accuracy obtained from this neural network is 97.52% (which is a little bit higher than our previous one-vs-all classifiers).

We can use the below code to display the image of the training set (one at a time) together with its predicted label.

# to display images from the training set one at a timei = np.random.randint(5000)
train_eg = X[i,:].reshape((1,X[i,:].shape[0]))
pred = predict2(Theta1, Theta2, train_eg)
print("Neural Network Prediction : {} (digit {})\n".format(pred, pred%10))
displayData(train_eg,1)

Here, the predicted label is ‘10’ which represents the digit ‘0’. And the image corresponding to the particular training example is displayed.

This is all for this part. In the upcoming Part6, we will train our own parameters using the cost function and the gradients obtained from the neural network backpropagation (using the same dataset).

Keep Learning. Enjoy the journey!