Training a Digit-Recognizing AI

Behind the Scenes of Training a Digit-Recognizing AI

20 min readJun 13, 2023

Ever since the release of ChatGPT, the Artificial Intelligence (AI) industry has been booming. As a programmer, I took some interest in the technology and decided to get my my feet wet by trying to build an AI myself.

I decided to follow a tutorial to make an AI that learns to recognize digits and tell them apart (ex: if you show the AI an image of the number 7, it can tell you that it’s a 7).

Even though I did manage to make the AI, I didn’t understand much of what was going on. So, following my curiosity, I delved deeper and deconstructed the tutorial to gain a better understanding of the theory. This is what I got :

Here’s a quick graph I made to represent the steps in building this AI

Disclaimer — Even though the goal of this article is to explain in simple terms the theory behind the digit recognition AI, I strongly recommend having a minimum understanding of what neural networks are. To do so, watch this video up to minute 3:43.

What’s so cool about neural networks?

In case you didn’t watch the video, neural networks are basically sophisticated algorithms that mimic the human brain’s ability to learn and solve complex problems.

Let me say that again. Neural networks, just like humans, learn how to do certain tasks. Unlike programming, where you need to explicitly code for each potential scenario (such as if x occurs, perform y), you just tell neural networks what you want them to do, train them and they. will. learn. by themselves. Isn’t that amazing?? (Well, I definitely find it cool as hell HAHAHA 😄)

So, the same way you teach a child how to do addition, you also teach a neural network how to do a task (in this case, the task is recognizing digits). However, contrary to humans, who have many ways to learn, for e.g.

observing and imitating others,
following the instructions and explanations of others,
trial and error,
etc.

neural networks only learn with trial and error.

MNIST dataset

Okay, but how do they learn with trial and error? Simple! Data. They need information, i.e. data, to learn.

So since we’re trying to teach an AI how to recognize numbers in an image, we need to show it a lot of pictures of numbers and their corresponding labels, so they can start figuring patterns for each number.

That’s pretty much what the MNIST dataset is about. It stands for Modified National Institute of Standards and Technology database, and in short, it’s just a very large database of handwritten digits.

Sample images from the MNIST test dataset. (image credit)

in the video, this step is delimited by the line of code :

mnist = tf.keras.datasets.mnist

Separating the training and testing data

So the first part is loading the data from the MNIST database, but once we’re done with that we need to split the data so the AI can actually use it.

The concept here is to separate all the data into two bundles : training data and testing data. The common practice in the machine learning community is to split 70–80% of the total data into training data and 20–30% into testing data. (There can also be validation data, but we’re not going to get into it in this article.)

Example of how the data is separated. (image credit)

(By the way the percentages are out of the total amount of data we have. So 80% training data means taking 80% of the total amount of data, and putting it in the training data category)

Now you may be wondering: why? Well, we do this because in order to confirm whether the AI actually learnt to do the task we wanted it to do, we need to be able to showcase it data it has never seen before (i.e. the testing data that we set apart).

This stage is categorized in the video by the line of code :

(x_train, y_train), (x_test, y_test) = mnist.load_data()

If you pay attention to the code, you can see that the data is even furtherly split into (x, y) for both training and testing (ex: there is x_test and y_test instead of just test).

We do this to separate the pictures from their corresponding label. For example, the first index of x_train is a picture of number 7, and the first index of the y_train is the number 7.

As you can see in this picture, the image of the number 5 is the x_train, while the “Label: 5” is the y_train (by the way, this is not 100% accurate because in the actual dataset the y_train doesn’t say “Label: 5”, it would just say “5”).

Normalization of the data

Once we’ve split the data into the training and testing datasets, we can now move on to normalizing the data. We’re in step #3 of the entire process, and in the last section of dealing with the data itself.

But what is normalization?

Basically, normalizing is the process of scaling down every value so they’re between 0 and 1.

Note: In a grayscale (black and white) image, the value of a pixel represents its brightness level. The minimum value of 0 represents black, while the maximum value (in this case 255) represents white. Every value in between is a shade of gray which gets darker as it gets closer to 0 and lighter as it gets closer to 255. This is important because the number images are in black and white.

For example, when normalizing the pixels in the number images, the white pixels would go from a value of 255 to 1, the black pixels would remain at 0, and everything in between would be shrunken down accordingly.

So once again, you may be asking : why? (It’s good to be curious keep it up! 😂). Essentially, normalization is used to prevent some features of the image from having a greater influence on the learning process than others, which can cause the model to be less accurate and slower to learn.

By normalizing the input features, we can ensure that each feature contributes equally to the learning process and that the model is more effective at generalizing to new data.

x_train = tf.keras.utils.normalize(x_train, axis=1)
x_test = tf.keras.utils.normalize(x_test, axis=1)

The code above is the one used in the tutorial corresponding to the normalization part. For any coder, it makes sense at a glimpse, but if you pay attention there’s something we haven’t seen before : the axis. So what is it exactly?

Shortly, the axis represents the column or row that will be normalized.

Since each image is made up of a 28x28 2D (2 dimensional) array, there are two axes which could be normalized : the rows or the columns. For example, look at this image:

The horizontal lines represent the rows, and the vertical lines represent the columns. The thing is, the way 2D arrays are made is by a combination of type

row(column)

in which the columns are INSIDE the rows.

For example, the 0 in the top left corner would be in the position [0][0]*, while the 0 to its right would be in the position [0][1], because it’s in the first row, second column. (This is somewhat very similar to matrices for the math people lol).

*(it’s [0][0] instead of [1][1] because arrays are zero-indexed, which means that we start counting at 0 instead of 1)

Now the thing here, is that the axis #0 (the rows) doesn’t actually hold the value of the pixel, but rather a second array (that’s the beauty and complexity of 2D arrays). So in order to actually access the value of the pixel, we need to access the axis #1 (the columns) in which the grayscale value is actually held.

That’s why we have axis = 1 (we’re accessing axis #1 — the columns) instead of axis = 0 (which would access axis #0 — the rows).

Model

So, now that we’re done normalizing the data, what’s next? Exciting things my friend, as we’re finally getting into the real core of neural networks : the model!!

So, what are AI models?

Remember how I said that AI learns through trial and error? Well the AI models are like the digital brains that are trained to do a certain task. These digital brains use algorithms (sets of instructions), to analyze data and make predictions or decisions, based on that data.

The structure and behavior of an AI is defined by its model, which is directly correlated to the effectiveness of the neural network. The model includes information such as the

amount of layers,
the amount of neurons in each layer,
and the type of activation function used by each neuron.

In this case, we’re using

model = tf.keras.models.Sequential()

which is not a pre-existent model but rather allows us to build our OWN neural network! Interesting is it not?! 😄 Let’s continue…

Layers

Since neural networks are composed of layers, we’ll want to understand them before we build our own customized AI model.

So if you want the technical definition : a layer refers to a group of neurons that process the same type of input. Each layer receives input data, processes it, and produces an output that is passed on to the next layer in the network. If you didn’t understand anything, that’s okay 😂. I’m going to explain :

Think of a neural network like a team of workers. Each worker has a specific job to do, and together they work on a bigger project. In a neural network, the workers are called neurons, and they work together to solve a problem or make a prediction.

Now, just like workers, not all neurons in a neural network do the same job. Some neurons process the input data, while others make predictions based on that data. That’s where layers come in.

A layer is a group of neurons that work together to perform a specific job in the neural network. Just like how a worker might specialize in a specific task, each layer in a neural network specializes in a specific type of computation. For example, one layer might look for patterns in the input data, while another layer might use those patterns to make a prediction.

In short, a layer in a neural network is a group of neurons that work together to perform a specific type of computation. Together, the layers in a neural network can solve complex problems and make accurate predictions. Cool, am I right?!

Which type of layer is used here? Part 1

So, lets give the code a closer look:

model.add(tf.keras.layers.Flatten(input_shape=(28,28)))

It seems we’re using the Flatten layer, which is used to reshape the input data (the pictures) into a 1D array, since they’re originally a 2D array.

In this example, the input data has a shape of [28;28], which represents a 2D image with 28 pixels along each dimension. The Flatten layer simply reshapes said input data into a 1D array with 784 (28 x 28) elements.

Example of input data (the only thing that’s missing is a drawing of a number lol)

So once again, why? Flattening a 2D array into a 1D array allows the neural network to process the data faster and more efficiently. This is because the number of parameters is reduced and the computations are simplified. We also do this since we’re going to be using dense layers later in our neural network, and these are designed to only operate on 1D arrays.

Example of flattening a 2 dimensional array into a 1 dimensional array

Which type of layer is used here? Part 2

After we’ve flattened our input data, we follow up with the lines of code

model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

As we can see, now we move on to the Dense layer type. So, what are dense layers?

A dense layer, also known as a fully connected layer, is the simplest and most common type of layer in a neural network.

In a dense layer, every neuron is connected to every neuron in the previous layer, AND the output from each neuron is used as input for every neuron in the next layer.

Here’s a visual example of what a dense layer looks like : as you can see, each neuron (circle) is simply connected to every neuron from the previous and next layers.

Visualization of dense layers. (image credit)

Dense layers are typically used for tasks such as classification, regression, and feature learning. These models have the ability to understand and capture complicated patterns between different pieces of information, and they are really good at working with data that has many different aspects or characteristics.

Does more layers = better AI?

This is a really interesting question I asked myself (as it would make things quite simple for us since all we would need to is to make a lot of layers HAHAHA 😂), but here’s the answer : not necessarily.

The number of layers in a neural network is just one factor that can contribute to its overall performance. Adding more layers to a network can sometimes improve its ability to learn complex patterns in the data, but it can also increase the risk of overfitting, which means the model becomes too specialized on the training data and loses its ability to generalize to new, unseen data.

In some cases, a simpler network with fewer layers may perform better than a larger, more complex one. It’s also important to consider other factors such as the size and quality of the training data, the choice of activation functions and other hyperparameters, and the computational resources available for training the AI.

model.add(tf.keras.layers.Flatten(input_shape=(28,28)))
model.add(tf.keras.layers.Dense(128, activation=’relu’))
model.add(tf.keras.layers.Dense(128, activation=’relu’))
model.add(tf.keras.layers.Dense(10, activation=’softmax’))

But wait! We can now understand what the Flatten and Dense parts are in the code, but what does the activation part mean??

Activation functions

So technically, an activation function is a mathematical function that is applied to the output of a neural network node (neuron), to introduce non-linearity into the output. They determine whether the node should be activated or not, based on whether the input is above a certain threshold.

Wait! I know you’re probably thinking “okay, but what the heck does that mean?”😂, but before I lose your attention, allow me to explain!

Let’s go back to the workers:

Think of a neural network as a group of workers called neurons, which are arranged in layers (as previously covered). Each worker has a job to do, and they need to decide whether to keep working or take a break based on the amount of work they have done so far.

Now, think of an activation function as a supervisor who tells the worker whether to keep working or take a break. The activation function looks at the total amount of work the worker has done and makes a decision.

The activation function is important because it helps the neural network to learn complex relationships between inputs and outputs. By applying non-linear activation functions, the neural network can model more complex relationships between inputs and outputs.

For example, in a binary classification task (where the goal is to classify input data into one of two categories), the activation function of the last layer may be a sigmoid function that maps the output to a value between 0 and 1.
If the output is above a certain threshold (e.g. 0.5), the neuron will “fire” and output a signal indicating that the input belongs to one category, and if it is below the threshold, it will output a signal indicating the input belongs to the other category.

Which type of activation function is used here? Part 1

Let’s understand better the activation function used in our AI! ReLU (Rectified Linear Unit) is one of the most commonly used activation functions in deep learning. It is a simple function that returns the input value if it is positive and 0 if it is negative.

This is what a graph of it looks like :

Example of the ReLU function. (image credit)

The advantage of ReLU over other activation functions is that it is computationally efficient and allows for faster training of neural networks.

Which type of activation function is used here? Part 2

The other activation function used in our model is Softmax. This activation function is commonly used in the output layer of a neural network when the problem is a multi-class classification problem, i.e., when there are more than two possible outcomes (like the digit recognition problem, in which the AI needs to choose out of 10 digits).

Basically, the Softmax function helps us convert a bunch of numbers into probabilities. It takes these numbers, does some calculations, and gives us a set of values that add up to 1. This lets us know the likelihood of each number being the right answer.

Here’s an example: let’s say we have a picture of a cat, a dog, and a bird, and we want the computer to tell us what animal is in the picture. We might use a neural network to analyze the picture and give us a list of numbers that represent how likely it is that the picture is of a cat, a dog, or a bird.

But just looking at those numbers might not be very helpful. We want to turn them into probabilities, so we can say something like “there’s a 70% chance that the picture is of a cat, a 20% chance that it’s a dog, and a 10% chance that it’s a bird.”

That’s where Softmax comes in. We can apply the softmax function to those numbers, and it will turn them into probabilities that add up to 1 (basically 100%).

Compilation

Okay! We finished “building” the neural network, and now what’s left is to to compile it! (We’re in step 7 of the process)

We compile the AI by using this line of code:

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics = ['accuracy'])

But holyyyy, wait a minute, there seems to be a lot going on in this code! Let’s break this down.

Optimizer

The first parameter in the compile line of code is regarding optimizers, but what are they?

Optimizers are algorithms used to update the weights and biases of a neural network during training in order to minimize the error or loss function. The goal of an optimizer is to find the optimal set of weights and biases that result in the lowest possible error on the training data.

Now unless you’re familiar with weights and biases (which are not covered in this article, sorry 😓) and loss functions (these will be covered later ! 😁), you’re probably super confused right now so let me explain what an optimizer does.

Let’s imagine you have a toy car that you want to drive as fast as possible along a track. The speed of the car depends on the adjustments you make to its various parts, such as the angle of the wheels or the power of the motor. Optimizers are like the tools you use to find the best adjustments to make the car go faster.

In this case, we can see that we’re using the Adam (Adaptive Moment Estimation) optimizer. This optimizer guides us to adjust parameters such as the weights and biases in the right direction, taking into account our previous adjustments.

Imagine the Adam optimizer as a baking assistant who helps you find the right amounts of ingredients to make the tastiest cookies. It not only tells you which direction to adjust the ingredient amounts but also learns from your previous adjustments to give you better recommendations.

Here’s how it works:

First, you randomly pick ingredient amounts and make a batch of cookies. Then, you taste them and rate how good they are. This rating tells you how well the ingredient amounts worked.
Now, the Adam optimizer steps in and does some calculations. It measures how changes in ingredient amounts affect the taste of the cookies. These measurements are called gradients. The optimizer pays attention to your ratings and figures out which ingredients had the most impact on the taste.
Based on these gradients and your previous adjustments, the Adam optimizer suggests new ingredient amounts that are likely to make the cookies even tastier. It learns from your past adjustments and tries to give you smarter recommendations each time.
You listen to the optimizer’s suggestions and adjust the ingredient amounts accordingly. Then, you bake another batch of cookies, taste them, and rate them once more. The process repeats.

Here’s the cool part: the Adam optimizer remembers how the ingredient amounts changed over time. It takes into account your ratings and the adjustments you made before. This way, it learns from each batch of cookies and helps you get closer to finding the best ingredient amounts for making the most delicious cookies. Or for our case, it finds the best way to recognize a digit!

Loss function

After optimizers, we also need to understand loss functions. In machine learning and neural networks, a loss function is a mathematical function that measures how well the model is performing. The goal of training a machine learning model is to minimize the loss function, which means making the predicted output as close as possible to the actual output.

For example, let’s say you’re trying to teach a computer to recognize handwritten numbers (which we are LOL). You show the computer an image of the number 5, and you want it to predict that it’s a 5. The loss function compares the computer’s prediction with the correct answer. If the computer predicts the number 3, the loss function will calculate a large distance because it’s far from the correct answer. But if the computer predicts the number 5, the loss function will calculate a small distance because it’s very close to the correct answer.

Which type of loss function is used here?

For our project we use the sparse_categorical_crossentropy loss function, which is used when we want a computer program to classify things into different categories. As a loss function, its task is to help the program learn and improve by telling it how good or bad its guesses are.

The special thing about this loss function is that it can handle when the categories are represented by numbers instead of special codes (one-hot encoded vectors). It figures out how close the program’s guess is to the right answer and tries to make the guesses better over time. Shortly, it helps the program get smarter at classifying things.

Metrics

What are metrics?

The last part to cover in the compilation are the metrics. In machine learning, metrics are used to evaluate the performance of a model. They are quantitative measures that are used to assess how well the model is performing on a given task. Metrics can be used to compare the performance of different models, or to monitor the performance of a single model over time.

Which type of metric is used here?

For our case, we’re using the accuracy metric, which is a commonly used metric to evaluate the performance of a classification model. It measures the proportion of correctly classified instances among all instances (so how many digits were correctly recognized by the AI).

For example, if a model classifies 90 out of 100 digits correctly, the accuracy is 90%.

Training the model

Once we’ve compiled our model, most of the work is done! Hurray! What’s left is to train the model on the data, and then we can use it to make predictions for us >:). (We’re in step 10!! We’re getting very near to the end)

Let’s continue studying the code: model.fit(x_train, y_train, epochs=3)

What this is doing is that it’s training the neural network model on the training data for a specified number of epochs, which is 3 in this case. (An epoch is the number of times a machine learning model goes through the entire training dataset during the training process. In this case, the AI goes through the training data 3 times since epochs=3).

But what is going on during the training? During each epoch, the model will feed the training data through the neural network, calculate the loss, and adjust the model parameters to minimize the loss.

Basically, it shows the the pictures we had split into the train category at the beginning to the AI, asks the AI what digit is on the picture, the AI gives it it’s answer and based on whether it’s wrong or not it adjusts the way it recognizes digits.

This process continues for the specified number of epochs, after which the trained model can be used to make predictions on new data.

Now you may be thinking, is it as simple as having the model go through more epochs to make it more effective? (Okay maybe you’re not wondering this, but I am so I’ll answer it anyways😂)

The answer is : not necessarily. Increasing the number of epochs can improve the model’s accuracy up to a certain point, but after that point, it can lead to overfitting, just like having a lot of layers.

That’s why it is important to find the right balance between the number of epochs and model performance. It’s a common practice to use early stopping, which monitors the validation loss and stops the training when it starts to increase, to prevent overfitting.

Evaluating

Once we’re done training our model, we need to do one very last thing before we can get to predictions (the fun part!) and that is evaluation! Fundamentally, the purpose of evaluation in machine learning is to assess the performance and effectiveness of a trained model on unseen data.

The code snippet loss, accuracy = model.evaluate(x_test, y_test) is used to evaluate the trained model's performance on the test dataset we had put apart at the beginning.

The evaluate method returns two values: the value of the loss function and the accuracy of the model on the test dataset.

print(loss)
print(accuracy)

The code then prints out the values of the loss and accuracy using the print function. This is useful for assessing how well the model performs on data it hasn't seen before, and can help identify if the model is overfitting or underfitting.

Predictions

Finally! We’re done with the boring technicalities of getting the AI model ready, and we can finally get to playing around with it 😁. In order to do so, we follow these lines of code :

image_number = 1    
while os.path.isfile(f"Samples/Sample{image_number}.png"):
    try:
        img = cv2.imread(f"Samples/Sample{image_number}.png")[:,:,0]
        img = np.invert(np.array([img]))
        prediction = model.predict(img)
        print(f"This digit is probably a {np.argmax(prediction)}")
        plt.imshow(img[0], cmap=plt.cm.binary)
        plt.show()
        
    except:
        print("Error!")
    finally:
        image_number += 1

This code is using the trained neural network model to predict the handwritten digit present in the images stored in the “Samples” folder.

It first initializes the variable “image_number” to 1 and then enters a while loop, checking if there exists a file named “Sample1.png” in the “Samples” folder, and if it exists, it proceeds with the following steps.

The code then reads the image using OpenCV and keeps only one channel (the grayscale channel). It then performs some preprocessing steps on the image, including inverting the colors and converting it to a numpy array.

The model’s “predict” function is then used to predict the digit in the image, and the result is printed along with the image itself!

Finally, the while loop then increments the “image_number” variable and checks if there exists another image file in the “Samples” folder with the incremented number. If there is no more image file, the while loop terminates. Here’s an example of how it works!

If you would like to try out the model yourself, make sure to check out this tutorial I wrote!

To conclude

AI can be pretty daunting at first, but hopefully after reading this article you have a better understanding of what’s going on behind a digit recognition AI!

Even though it can be pretty complex, it doesn’t have to be, and the first step towards learning AI is to research and try out tutorials yourself!

Whether you’re here because you’re learning about AI, trying to support me or simply curious, I hope you’ve come out of this experience more knowledgeable, even if just by a bit :).

Thanks for reading, and see you next time!

P.S. if you enjoyed the article, you can follow me on here. You can also check out my newsletters here to keep up with me monthly.