Image Classification using Logistic Regression on the American Sign Language MNIST

A simple image classification script using PyTorch

American Sign Language Alphabet

For starters, PyTorch is an open source machine learning framework that can be used in both Python and Java/C++. It is commonly used for deep learning and natural language processing. For more information about its installation, documentation, and tutorials about using PyTorch you may accessed it here.

Another term mentioned in the title is Logistic Regression. Logistic Regression is a type of analysis that is used if the goal is to determine the category/class of the output. For example, if the email is a spam or not. In this case logistic regression is used to determine the letter symbolized in the hand image.

About the Dataset

The American Sign Language MNIST Dataset used here is obtained from Kaggle. This dataset is much like the original MNIST dataset. Each training and test case consists of a numerical label (0–25) with a one-to-one correspondence to the English alphabet (0 corresponds to A) and a grayscale 28x28 pixel image with values ranging from 0–255. However, there is no label correspondence to the letter J (9) and Z (25) due to the motion required to symbolize those letters. The number of testing and training cases in this dataset are much lower compared to the orginal MNIST dataset since there are only 27,455 training cases and 7,172 tests cases in this dataset.

Sample Training Images

Let’s Get Started

This article assumes you have anaconda installed in your system. If you don’t have it installed you may view the instructions on installing anaconda here.

What are these libraries?

  • Numpy and Pandas — for data handling and processing
  • PyTorch and Torchvision — for machine learning functions
  • Matplotlib — for data visualization
  • Jovian — for saving the code in the platform

After installing these libraries, we need to import them to our script.

Preparing the Data

In training this model, I ran the code on Kaggle which makes it easier for the program to run since we’re doing the processing on the cloud and the dataset itself is hosted in Kaggle.

First we will define the hyperparameters of our model, batch size and learning rate. These hyperparameters control how fast our model will learn through the training cases. The batch size dictates how many images the model will load at a time. Ideally you would want to load all the data. However, limitations on the the processor and working memory of the system prevents this especially if the dataset is large. The learning rate on the other hand dictates how large the model will adjust its parameters after every training phase. Setting these hyperparameters are arbitrary. In this case, the batch size was set to be 256 and the learning rate to be 1e-5.

Other constants are also defined such as the input size and the number of classes. The input size is the amount of data needed to represent a single image. In this case 784 is the input size since the image is a grayscale 28x28 pixel image (one value for each corresponding x and y coordinates). Since our outputs are categorical, the number of classes dictate how many categories are there. In this case 26 is the number of classes since the labels are values from 0 to 25 inclusive.

After setting the hyperparameters, the training and testing dataset are then loaded. A dictionary is also created that maps the numerical label to its corresponding English alphabet equivalent.

Exploring the Data

The first few rows of the training dataset are shown below.

We need to separate the pixel values and the label from each other in order for us to load and access it separately. A function was constructed to split the training and testing dataset to separate the labels from the pixel values.

Let’s look at the first row of the training dataset. We also need to reshape the array to (28x28) since the initial shape is just a row array.

As expected, the letter in the hand image is D. However, it is evident that the image is not clear due to its small resolution. This may affect the accuracy of our model and the implementation of the model in a much larger scale.

The training and testing input arrays are converted to continuous float values since it allows our model for a more precise learning as compared to discrete values. On the other hand, the training and testing labels are converted to long integers since the output of the model are indices to be used in accessing probability values.

Dataset and Data Loaders

In this part the input arrays and label arrays are wrapped in a single object for both the training and testing datasets. The training dataset was also split to a training and validation subset. In this case, 15% of the original training dataset was placed in the validation subset.

After splitting the datasets, the data are loaded in batches with the size defined earlier. The data loaded for the training dataset are also shuffled since there might be a chance that the elements in a batch are homogenous (i.e. all are from one category) which might lead to an inaccurate model. Since the validation dataset is just used to determine the accuracy of the model, shuffling the data is just optional.

Let’s look at the first entry of the training dataset.

Since this image is different from the image we saw earlier, there is an assurance that the datasets are randomly split and shuffled.

Constructing the Model

In this part, PyTorch’s built-in linear function was used. Linear regression works where each independent variable multiplied by a weight and added by some offset bias affects the classification of the output. In this case, the independent variables are the pixel values and the output is the letter classification.

In this case, PyTorch’s linear function will output a tensor with 26 elements with each element denoting the probability that the image is that symbol. For example, if the output of the function is [0.0001, 0.003, …, 0.001], then there is a 0.0001 chance that the letter in the hand image is A. The numbers in the label array correspond to the index of the element with the maximum value. However, the linear function will not return probability values at the first run.

Initially, the function returns a tensor with 26 elements with values ranging from negative infinity to positive infinity. Well, technically not infinite since the values are limited by the number of bits allocated in the memory. As such, the values must be normalized between 0 and 1.0 inclusive.

Here, I used PyTorch’s cross_entropy function that combines the negative log likelihood and softmax function to normalize the resulting values from the linear function. A mathematical description of how the cross_entropy function works is shown below.

Cross Entropy Loss Equation Obtained from PyTorch Documentation

The softmax function is the term inside the negative logarithm. Each normalized value of the tensor is obtained from substituting each value to the exp(x[class]) term dividing by the sum of all the exponential of the original tensor values.

The code for constructing the image classification model is shown below.

Upon instantiating the class, random weight values and bias values are created. Hence the need for the training_step and validation_step methods. On the other hand, the validation_epoch_end and epoch_end methods are there to present the current performance of our model.

Training the Model

Initially, the model would have a low accuracy since it contains random values. Since we’re dealing with a probability distribution, we will take the index of the highest value to be the prediction of our model. Here we will define the accuracy of the model by dividing the amount of times the model correctly predicts the image by the total number of input images.

The bulk of the training step occurs by feeding in our batches of dataset to the model. The accuracy function would not be able to help us in determining how well the model is improving since it only looks at the output. As such, we will use Stochastic Gradient Descent to determine how well the model is improving.

Linear regression works by fitting a curve that represents the relationship between the data points. Similarly, the loss function is also a curve. We aim to improve the model by gradually reaching the local minima of the loss function to minimize the losses. At this point, the slope is zero and the model is fairly stable. Most of the time, upon instantiating the model, the point corresponding the performance of the model in the loss function is somewhere above the local minima. Hence, there is a stochastic descent to, ideally, where the gradient will be 0.

The rate of adjustment from the initial parameters to the new parameters is dictated by the learning rate. Earlier, the learning rate was defined to be 1e-5.

It is important to call the zero_grad() function after every adjustment of the model parameters in order to reset the gradient values for the loss function.

Let’s evaluate our model.

Since the model parameters are random upon instantiation, it is expected that the accuracy is very low and the validation loss is high. The next step would be training the model. In this training phase, the model was trained for 50 epoch or 50 iterations with a learning rate of 1e-4.

It would be much easier if we look at how the model performs using graphs.

As seen from the graph, as the amount of epochs increases, the accuracy of the model increases. However, it is worthy to note that the model was somewhat unstable as there are dips in the accuracy of the model. This could be attributed to a large learning rate. Let’s try to evaluate the model using the testing dataset.

The accuracy of the model is just above 50%. Let’s try creating a new model but train it with a smaller learning rate.

Like the prior model, it has a low accuracy and high validation loss due since it only contains random model parameters. The model was then trained for 50 epochs with a learning rate of 1e-5.

By visualizing how the model performed using the same method used above.

The second model did not suffer a strong dip in the accuracy unlike the first model. In addition, it did not reach a higher training accuracy unlike the first model. Let’s try training it for 50 more epochs.

Let’s look at how the second model performed across 100 epochs.

From the graph, the second model somewhat plateau at about 80%. Let’s then evaluate this model to the testing dataset.

The second model did somewhat better than the first model but not much. The major difference between the models are their validation losses. This means that the second model returns a better probability distribution than the first. However, since we are only taking the one with the highest probability, then having a better probability distribution could not be seen in the accuracy.

Using the Model to Predict Images

A helper function is created to pass an image to the model and return the model prediction. The code for this function is shown below.

Let’s try running the function.

In this example, the model correctly predicts the letter portrayed by the hand image. Let’s try another one.

Here the model incorrectly predicts the image. One probable reason for this is the concentration of dark pixels in the middle similarly when one would symbolize “N” using the American sign language.

Saving the Model

After constructing and training the model, it is important to save the model parameters in order to use the model for future works. That way, time and processing power can be saved since the model no longer require training. The code for saving the model is shown below.

The model parameters are saved as ‘mnist-logistic.pth’ and is located in the same directory as the script. The saved file would not be there if the working directory has been changed during the run of the script. Let’s look at the model parameters.


Logistic regression was used to construct the model with accuracy just above 50%. From the trend of the performance of the model, increasing the number of training epochs also increases the training accuracy of the model. However, the training accuracy of the model will somewhat plateau at about 80–90%.

This may be the limitation of constructing the model using just logistic regression. Other more sophisticated models could result to a higher accuracy.


American Sign Language MNIST Dataset:

Image Classification using Logistic Regression:

PyTorch Neural Network Functions: