Digit Captioning with LSTMs.

5 min readJan 4, 2019

You’ll only need keras, numpy and opencv. In this post I’m going to show you a possibility of what can be build with out-of-the-box python libraries for a deep-learning project.

Data

MNIST is a widely known and used dataset: images of handwritten digits. This dataset, along with others like CIFAR, is often used as a benchmark for testing different kinds of proposals in deep learning, such as new architectures.
It’s so used you can download it via keras in one line.

With a bit of creativity (which may be surprisingly important in deep learning) and the proper libraries we can create an entirely different dataset out of MNIST. Let’s sample 9 random observations from the training data and concatenate them.

You probably guessed what we did with the previous code but just to make sure here’s a visualization of it.

random generated sequence using MNIST

So now we know how to generate images of a sequences of random MNIST observations. This might not be that impressive but the next thing to try should start your curiosity.

Modeling

With a dataset that consists of images and associated text, your deep-learning creativity should start flying and whispering … “create a model”. Of course it should handle these two types of data with the proper structures. Let’s take a look at how it could potentially look like.

Conv + LSTM model

(Disclaimer: this model is just for illustrative purposes, a working and proved model can be found on the Github repo provided)

After some training using proper training and validation sets, we get a model that can caption the numbers in images with high accuracy ~96%. Here’s what the model captions on an unseen image.

This model has two core types of layers: 2D-Convolutions and LSTMs. An argue could be made to state that these two are within the most important layers for current deep learning architectures that solve most Computer Vision and NLP problems.

Tips on how to combine Convolutional and LSTM layers.

Some important details to pay attention about this model involve getting the dimensions of tensors right, how to reshape them and dealing with the sequences using LSTMs.

It’s crucial to reshape properly the resulting tensor of the convolutions so that it can be used by the LSTM. Note how we set the number of filters in the last convolution to match the sequence length. Once we’ve done that, we need to reshape our resulting tensor into one that represents a sequence so it can be properly used by the LSTM layer.

2. Also notice that we set the parameter `return_sequence=True`. In the first LSTM layer we did so since we want to stack another layer that handles sequences and in the second one as well since we want to get the full sequence as prediction.

3. Finally we use the standard way to get class predictions: a Softmax layer. It’s crucial that you handle tensor shapes along the model since the Softmax implementation of Keras only admits 2D and 3D tensors, ignoring the batch dimension.

As we can see, the model still has room for improvement and it would be easy to do so by keep training it and monitor the validation loss to avoid overfitting.

(Trying to) Get real

Being realistic, this dataset is easy to model and, potentially, we could reach ~100% accuracy with a bigger model and more training. Nevertheless, data in real world problems will most likely be dirtier and more complex. If you are dealing with photos as data you could potentially find blurry ones or other cases of complexity in the data. We can mimic that using OpenCV. Consider a new, stochastic, way of generating data to train the model.

Example of blurred generated sequence

OpenCV has several ways to blur data, we consider in this example just one.

With this new way of generating sequences we train another model with “more realistic” data. After training for 13 epochs we obtain a model that has ~94% accuracy on the non-blurred validation dataset and ~91% on the blurred validation dataset. The model keeps yielding high performance metrics despite the way in which the data was complicated. This is a good sign for model robustness.

Prediction on blurred generated sequence

Why is this useful?

This is clearly a toy-project to have fun with. Even though, this is pretty nice practice to hone your programming, computer vision and deep learning (even creativity) skills specially considering that the dataset and tools used in it are open source but actually there is more to it.

It might not be that intuitive to grasp but with the previous example you can train a complex neural network that detects edges and shapes for sequences. With more data and training time you could use this model as a base using transfer learning to build a model that captions custom data. This exercise also works as a benchmark if you want to train a similar model from scratch on your own data considering its a fairly small model.

Last thoughts on the experiment

There are many ways to exploit open source datasets and tools, it’s surprising how democratization of technology and, specially, deep learning has come and how it keeps opening up every day. This is just an example of what can be done with a few lines of code and creativity.

As always, here’s the repo for you to clone, fork or comment cause < 3 open source.