Recognition of hand-written numbers using Logistic Regression

Aiswarya M
5 min readAug 28, 2020

--

AI Workshop — Part II

In the second part of the workshop, we learnt about Classification, also called Logistic Regression that also comes under Supervised Machine Learning. The first part of the workshop focussed on Linear Regression.

Both regression and classification predict the output. Regression predicts the output in a continuous numerical range. Classification predicts the output in a discrete range. For example, predicting the temperature at a place based on weather report data is a regression problem. Whereas, predicting if the day’s weather comes under ‘Sunny’, ‘Cold’, ‘Windy’ or ‘Rainy’ categories is a Classification problem.

Here, we learnt to imply logistic regression to recognise a hand-written numerical character (of pixel size 28X28)and classify them to the corresponding number (1, 2, 3, 4).

1. Reading the data

m number of data samples were used for the prediction model where each data sample is an image of pixel size 28X28 and each image is a hand-written number of white colour (pixel value = 255) on a black background (pixel value = 0). Here, m=3599. The array for each image comprises of the number (that has been written on the image) followed by pixel values of the image, i.e, it had 784+1 elements (1st element is the number written on the image, followed by the 28X28=784 pixel values of the image). The raw data is a matrix (3599 rows, 785 columns) read from a CSV file that comprised of pixel values for each image.

2. Separating input (X) and output (Y)

The first column was extracted as the expected output (Y) and the rest of the array elements (corresponding pixel values) were extracted as the input (X) used to train and test this prediction model.

Thus,

size(X) = (3599,1)

size(Y) = (3599,784)

In this classification problem, the model is to predict the number written on an image. The output of this model is a number. Of all the m=3599 data samples, the number of unique classes was found. In this case, the classes were 1, 2, 3 and 4.

3. Splitting the raw data for training and testing

75% of the data samples were used for training (m=2699) and the rest were used for testing (m=900). Here, there are 784 features. As seen in the case of Regression, the prediction model is

Eq(1): f(X) = X*Transpose (θ)

Here,

Eq(2): f(X) = θ0 + x1*θ1 + x2*θ2 + … + x784*θ784

where,

Eq(3): X = [1, x1, x2, … xn], Size(x) = 1 row, n+1 columns (n=784)

Eq(4): θ = [θ0, θ1, θ2, … θn], Size(θ) = 1 row, n+1 columns (n=784)

4. Logistic function, g(z) — Sigmoid function

The error cost function in the case of Linear Regression was

Eq(5): J = (1/2m)*(f(x)-y)²

This returns a value between 0 to 1. The objective of the error cost function is to measure the difference between the predicted output and original output and return the difference between 0 to 1. For this purpose, the logistic function used here is a sigmoid function g(z).

Eq(6): g(z) = 1 / ( 1 + e^-z )

The sigmoid function was coded as

Sigmoid function

An intermediate variable z can be defined as

Eq(7): z = 1 + e^-x

5. Error function — J(θ)

‘One versus many’ concept was used here. If the hand-written number is 2, then it should be compared with each class.

Concept of comparing the expected output versus the predicted output
  • When the expected output is “YES” (Y=1), the error ranges from 0 to ∞ when the predicted output ranges from 1 (YES) to 0 (NO) correspondingly. Mathematically, the error can be given as

Eq(8): J(θ) = -log(g(z))

when Expected output is “YES”

  • When the expected output is “NO” (Y=0), the error ranges from 0 to ∞ when the predicted output ranges from 0 to 1 correspondingly. Mathematically, the error can be given as

Eq(9): J(θ) = -log(1 - g(z))

when Expected output is “NO”

The error cost function is

Eq(10):

Eq(10): Error cost function

Generalising Eq(10) by taking Y into account,

Eq(11): J(θ)= y *( -log(g(z)) ) + (1-y) * ( -log(1 — g(z)) )

Calculating the error for the dataset taking all the m entries into consideration

Eq(12):

gg
Eq(12): Final Error cost function

Equation 12 is coded as

Error cost function J(θ)

6. Gradient

The gradient is

Eq(13): Gradient = ∂J(θ)/∂θ

Eq(14): Gradient = ( ∂J(θ)/∂z ) * ( ∂z/∂θ )

Eq(15):

Eq(16): Gradient = x*(g(x*Transpose(θ)) — y)

Eq(17): Gradient = x*Error

Finding the gradient for all the m values and including the learning rate c, the final gradient becomes

Eq(18):

For each class, the gradient is minimised, during which the 785 parameters are optimised and stored in new_theta matrix.

7. Real-time prediction

An image of pixel size 28X28 is taken as the input. If it is a black coloured text written on a white background, the image is inverted. The image is converted to a row matrix x, inserted with a Bias term. For each class, the prediction output is given as

[1] prediction = sigmoid(x*new_theta.T)

# where new_theta.T means transpose of new_theta

prediction represents the probabilities of matching between the number written in x and the class. The class with the maximum confidence or probability of matching is chosen as the predicted output. The predicted results were compared with the original results to calculate the accuracy.

Challenges faced

  • Logistic regression is more complicated than Linear Regression. It was easier to code after working out the math for the error function J(θ), gradient (∂J(θ)/∂z) and then coding them.
  • If the image is coloured, it needs to be converted to grayscale, converted to a binary image (black and white) and ensured that the character is white on a black background, before subjecting it to prediction.

--

--