MNIST- Exploration to Execution.

Madhu Sanjeevi ( Mady )
dataDL.ai
Published in
12 min readDec 10, 2019

Hello All, This is my first story in this publication, I wanna make it as useful as possible.

So in this story I am gonna take the most famous dataset in ML community called MNIST, explore it as much as possible and finally build good models with conclusions.

Note: This story reading time is way more than Medium says it is and you think it is so if you are serious about learning you gotta give that time.

About MNIST.

→ It has the size of 28*28 black & white images of hand written digits

→ It’s probably one of the first datasets to prove the effectiveness of the algorithms and ideas in neural networks (we build)

→ It contains 60,000 training images and 10,000 testing images, used for the task of image classification.

→ It’s probably the most cleanest dataset you ever find on internet for Machine/Deep learning models as it has a good bias-variance balance.

→ The current error rate is ~ 0.21% by using Convolution neural networks with data augmentation.

HW and SW tools usage.

Ubuntu (Linux), GPU (RTX 2080 ti), CPU (AMD Ryzen), 32GB ram, Cuda 10, Python, Pytorch, Numpy, Matplotlib, Scikit-learn and jupyter notebook.

Outline.

  1. Understanding the stats/distribution of data set
  2. Dimensional Reduction Visualization.
  3. Best Model finding/fine tuning.
  4. Optimizes comparisons on the data set.
  5. Understanding of trained Weights distribution
  6. Trained model gradient visualization.
  7. Visualizing the trained hidden layers.
  8. Gan training.
  9. Transfer learning on MNIST.

Style of Explanation.

  1. All the jupyter notebook code will be available on my Github, here I attach the images of code snippets (I love looking at the code and output).
  2. I skip attaching/explaining the unnecessary code which is available on GitHub anyway.
  3. I pre assume that the readers have the ML/DL vocabulary and some concepts and Math.
  4. Understanding the intuition is important than understanding the code/logic (why I do that over how I do that).
Let’s roll.

Let’s first load and see the data ( t as torch , tv as torchvision)

here I plotted sample 25 images of the batch of 32 images (train_loader iterates the batch of images).

The data has been normalized between 0 and 1 (pixel intensity).

Understanding the stats/distribution of dataset.

I took all the images as a one big numpy array to calculate the mean images (class wise) to get some sense of the data.

here are the results

The mean images look really good which proves that the data does not have a lot of noise and its crystal clear for DL models.

let’s understand the data/pixel distribution,

I plot the histogram for each class (all the images pixels values count).

X-pixel values(0-1) , Y- Number of pixels

as you see most pixels values are ZERO(black), some are ONE(white), few between 0 and 1 for all the classes.

if we plot the mean images distribution,here it looks like (class 1 takes the least number of white pixels or more dark pixels) #ofcourse.

X-Classes (0–9), Y- the mean value for each class

Understanding these stats and distribution is important when you want to do feature engineering/scaling for a dataset.

for example here every class has a different mean so you can consider “mean” as another feature or you can make the entire data as zero centered and feed it to the models without the mean feature. #uptoyou #depends

Dimensional Reduction Visualization.

The whole point of dimensional reduction techniques is to convert the high dimensional data to low dimensional data effectively (without loosing too much information of the data)

they are good for data visualization , feature selection and engineering.

here input X has 28*28 pixels (784- dimensional vector) so let’s apply PCA.

it’s a linear dimensional reduction technique which captures the most variance features (2 or 3 out of 784)

as you can see here, the data/classes have been split into 10 different clusters/groups.

while we can see, the visualization looks good even with just 2 dimensions, its not enough to separate the classes so luckily we have another techniques called t-SNE which is a non linear dimensional reduction technique and a probabilistic approach unlike PCA which is a mathematical approach.

t-SNE requires a lot of computation thus it takes a lot of time (minutes to hours) compared to PCA (secs to minutes) so here I used Multicore TSNE which took around 10–15 mins over scikit learn t-SNE which seemed taking tons of time.

Note: Recently Rapids cuml (GPU implementation of t-SNE) takes only few secs for this job.

here we can clearly see the classes have been split well.

the variable embeddings holds the 2-d vectors for all corresponding MNIST images of 28*28.

here I plot the original data and t-SNE embeddings.

784-d vs 2-d distribution.

so you can see the data distribution of original and t-SNE’s

embeddings axis 0 and 1 distribution

let’s plot each class embeddings.

if observed, the classes 3 and 8 have some tough time while others are pretty good especially (0,1,6)

Best Model finding/fine tuning.

This is probably the most interesting step and most important step for DL practitioners.

Although there is no particular recipe,there are some things that work well (since this field is progressing very quickly , new things come to wipe out the old tricks).

Rule 1: Everything depends on the “Data” that you have.

Rule 2: Sometimes depends on the cool tricks and algorithms.

The way machine learning works is as follows

The data space gets multiplied/added with some n arbitrary dimensional vector space to find a solution space where a good X to Y mapping is achieved.

The loss, the optimization, the processing, everything depends on the numbers (data) that we have so.. having a good dataset is super important.

Since we have the cleanest data, lets create a random model for the classification task.(as you might know CNN’s work well for image tasks so I take that onl;y).

A 2 layer model (Conv+FC)

I just took a 2 layer network (Conv+FC) and Lr = 0.01 and momentum = 0.5 (generally this learning rate works well )

Lets train it

As you can see , after 10 epochs the train accuracy reaches 99% because the data is pretty easy for the model to generalize/separate classes

Attention: It is not about fitting the data, its all about generalization and More data requires big networks and Big networks require more data.

lets add one more conv layer and train it

as you see, the accuracies got improved a bit by adding another Conv layer.

Optimizers comparisons on the dataset.

Above I have used SGD as the optimizer, lets try other optimizers with the same network and same training procedure.

I took the same model but a different optimizer for each network and train all with the same procedure.

Final Epoch results

RMS prop is the clear winner in this race so we can take the same network as we took before then we can use RMS prop as the optimizer for the network.

Since this data set is super easy for even smaller networks, Lets stop it here finding better models and let’s focus on the trained models.

Understanding the weights & gradients

Let’s understand how weights and gradients are changing during the training of the current best model.

I took the same network(SecondModel()) which has 2 conv and 2 fc layers and RMS prop as the optimizer as it performed well and ran for 5 epochs

During training, I take the weights and histogram plot them and save them as images.

The below GIF shows that how the networks weights distribution is changing during the training for all the layers.

The uniform distribution of weights gets slowly transformed to normal distribution during the training and observe that Fc1 has a lot of neurons so most of the neurons weights are very closed to zero.

Let’s also save the gradients and plot them.

I save the gradients after loss.backward() is run by calling the save_gradients function and after training I plotted them, here is how the gradients flow is going from last layer to first layer.

As you can see the gradients are vanished at Fc1 but good back propagation to conv1.

below is an example of a good & bad gradients flow

Alright! now lets visualize the trained weights.

Let’s visualize the first and last layer to see how it looks

The weight values from negative (Black) to positive (Bright Yellow) for example

I just plotted only 30 filters, clearly we don’t understand/interpret these images but apparently the model understands these.

Each filter looks for a specific pattern/color in images (how much redness/bluesness/ is there in region)

since the dataset has gray scale images, these weights don’t make sense,if we take the color images , we would have seen those patterns.

lets plot the last layer.

Alright! now let’s take an image and pass it to through the network and visualize all the layers.

Plotting feature maps of Conv1 layer

This seems like applying different filters for the input (low light, low color, high brightness, etc..)

It is easy to get some intuition about what this layer is doing unlike other deeper layers, as the input passes through each layer , the inputs gets smaller and smaller so it’s difficult to visualize/ understand them.

Plotting feature maps of Conv2 layer

As you can see, the conv1layer’s output shape is (32,26,26 ) and the conv2 layer’s output shape is (64,24,24).

Plotting the FC1 layer
Plotting the FC2 layer

As you can see, its giving high probability to the class 2.

Gray scale visualization.
Model working on Weights.

Machine learning only speaks one language, the language of weights so during the training it leans those weights which have the knowledge representation about that data.

Few years ago researchers figured out , this kind of knowledge can be transferred from one model to another model.

they called it Transfer learning.

Since this data set is very clear , applying transfer learning will not boost the model performance as much as it usually does but definitely it improves the performance.

Let’s try it.

I took the pretrained model of Vgg 16 which has been trained on Imagenet so it has 1000 classes.

There are two models in here, 1st one acts as feature extractor, 2nd one is the classifier, let’s change the last layer with only 10 classes

we first freeze the feature extractor which means we don’t train that part(during back propagation, the weights of this section won’t be changed) so here is how we do it in pytorch.

Now let’s train the model

It takes a lot of time as it’s a big model but the training and testing ac curacies got improved.

Of course the dataset is very simple and clear.

Losses and Accuracies

Let’s visualize the first conv layer weights of vgg16 pretrained on imagenet

It seems like each filter has certain types of mixed colors which look for some color patterns in the inputs.

let’s pass a minst image to this layer

Now let’s pass this image to all the layers in the feature extractor, here is how it looks (I plotted on 64 feature maps for all the layers)

VGG16 Feature Extractor.

As the image size get reduced, it get hard to visualize the latter layer

Anyhow transfer learning is really a powerful technique to apply to many complicated problems in Artificial intelligence with deep learning (I will discuss in later stories deeply).

Enough! I leave

Don’t go! there is still some stuff to experiment.

So far we focus on the classification task which belongs to the category of “Discriminative” models

Given inputs we want to build a model that can classify the inputs to the corresponding targets as correct as possible.

It learns the conditional probability distribution.

let’s also experiment with “Generative” models to learn the joint probability distribution.

Ex: Generative adversarial networks (GAN’s)

Yeah let’s train a gan to learn these MNIST data distribution ( I choose the conditional Gan to understand class wise distribution )

if you are new to GAN’s, check out my other story GAN’s with Math.

I took simple networks for Generator and Discriminator

Generator
Discriminator

The below function calculates the discriminator’s loss (Fake + Real)

The below function calculates the Generator’s loss (Fake)

Let’s train the gan for 200 epochs and book keep the generator’s outputs for every epoch.

GAN Training

Final outputs from CGAN are

Random classes
Class wise images

This is how the CGAN is generating the MNIST images from epoch 1 to 200

Here is the final GAN data distribution compared to original distribution.

Left: Real data 100 samples Right: Generated data 100 samples

As you can see, its not able to generate diverse set of images for each class, only one image for class has been generated.

we can try using batchnorm or dropout or different kind of architectures for improving the gan, but for the scope of this story , I leave it like this.

GAN’s model are really difficult to train, many times we encounter problems like “Mode Collapse”

For example, below is the result of mode collapsing ( I took some architectures and there collapsed).

The generator get’s stuck at some local space and keeps on generating the same image(‘s) to trick the discriminator.

which is why training GAN’s is really tricky but the output is really worth it.

Well, That’s all for this story. I will come up with new dataset for the next story.

Here are the things that have been discussed.

MNIST data set understanding , PCA t-SNE, Model building, optmizer finding, trained weights understanding with gradients, hidden layers visualization, transfer learning and GAN on MNIST.

Just wanna say “Bravo/Brava” for making it to the end and feel free to leave your thoughts/comments/suggestions/critics.

The full code is available on my Github.

In case if you want to connect with me, here are my Twitter and LinkedIn profiles.

Have a good time learning/playing with Deep learning.

--

--

Madhu Sanjeevi ( Mady )
dataDL.ai

Writes about Technology (AI, Blockchain) | interested in Programming || Science || Math https://www.linkedin.com/in/madhusanjeeviai