Solving Sudoku: Part IV

17 min readApr 7, 2020

The time you’ve no doubt been waiting for has finally arrived. I realise it has been more than 2 years, and I’ve horribly forgone my responsibilities to anyone who was following this series: I can only offer my apologies. In part, this fell off my radar because I found myself quite busy after conducting the job interview I talked about in part 1. I decided to pick it up again as I suddenly found myself with some free time and a lot of time indoors due to a gigantic international health crisis you might heard of.

In this, the final chapter of Bernard’s epic journey, I will share the full code base and explain the final methods for improving the accuracy of our plucky Sudoku solver.

For those of you with social lives for whom the last paragraph had no context, I recommend having a look at the previous entries where we:

Introduce Bernard, the Sudoku solver extraordinaire.
Teach him to see.
Teach him to think.

If you can read all of those without vomiting, panicking or escaping to join a cult, you are probably the type of person that can tolerate reading to the end of this article.

In the previous blogs I have been reluctant to give access to the full project as I suspected it would be overkill and hinder learning of the individual techniques. Since we’re now near the end of our destination, I’m happy to commit to the big reveal. This repository contains all of the up to date code for the vision algorithms and the neural network explained in this blog. There are also helper functions to assist with classification, augmentation and model training which will be useful when going through this blog.

Improving on Mediocrity

Currently Bernard is able to identify a portion of our test images, getting around 58% right. This is isn’t particularly good though and there’s a few easy things we can do to help him along with detection. Our original algorithm was pretty basic and it means that for some grids we’re losing valuable information by the time we get to attempting digit recognition. I’m not going to go into each of these in too much detail, but will indicate where these heuristics have been implemented in the code:

Estimating the width of the border of the outer grid and cropping that out as well, improving fidelity of digit images. [Code]
Auxiliary grid detection algorithm for certain cases when using contours is ineffective. [Code]
Use Hough lines and linear algebra to better estimate the location of the grid lines in the image. [Code]
When an impossible digit is predicted for a given cell, use the next most likely prediction. For example if two 3s were predicted on the same row, find the next most likely prediction for those two cells and use that instead. [Code]

Those are all quite simple changes that should bring up the accuracy with our digit recognition method, but the obvious thing we can do is train the model on a larger data set. Seeing a wider range of examples should help it learn to deal with a greater variation in image. The archive of Sudoku images is now up to 129, which gives us a fairly meaty set to work with. In the repository you will find an archive of all the test images alongside descriptor files which have been tediously collated for Bernard’s benefit.

We can use this bank of images to better train Bernard in image recognition. First we should split the images into test images and training images. The digits from the training images will be shown to Bernard when teaching him recognition. However if we measured his success on the images we’ve already told him the answers, one might rightly accuse us of cheating. So we keep a test set of images that we haven’t used in training to see how much he’s learned. If we were being really strict we could use a separate validation set to avoid any bias in when we choose to stop training the model, but it’s not that important here. For training the models in this project, I split the archive into 80 training images and 50 test images, admittedly somewhat arbitrarily.

We can increase the amount of training data further by artificially augmenting and perturbing the existing images we have. We can blur, crop, skew, shift and pixelate them at random to increase the variation of our training set and allow the model to learn in a way such that it is more robust to unseen images.

After all that effort, with all this data and our improved vision algorithm, Bernard was able to recognise 98.1% of the 2094 test digits. Woohoo right?

Wrong. This sounds quite impressive, but actually when we test recognition across all 130 boards, we only get 73.2% accuracy. This is way lower than 98% and isn’t much better than our original effort which solved 58% of boards. This is because we are training Bernard to recognise digits individually but all with a view to being able to solve Sudoku boards. Given a Sudoku board has 81 digits and requires every single cell prediction to be correct in order to solve the board, it vastly amplifies our errors.

When we look a bit closer at the failures, we can immediately spot another problem. Below is a selection of examples of some of the digits that were incorrectly recognised.

**Figure 1:** Sample of erroneously classified digits using the original technique.

The middle column shows what the digits look like after the parsing algorithm has been applied and it’s plain to see that in these cases we’re losing information that is required to recognise the digit. Even humans can’t correctly classify all the images from the middle section. We could always continue to tweak the extraction algorithm but it is likely that there will always be some situation where we lose too much information and actually make it more challenging for the model instead of less.

The solution is quite simple: don’t process the digits at all. If humans can classify the images on the left, we ought to be able to teach Bernard to do the same. So very quickly I prepared sets of images without the processing and fed them to the original neural network and let it learn. Given how easily Bernard grasped the recognition of digits before, surely it’s only a matter of time for him to pick up this new style of image?

Step 0
Test accuracy 0.0404938
Step 100
Test accuracy 0.649877
...
Step 800
Test accuracy 0.649877
Step 900
Test accuracy 0.64987

Sadly I was to be crushingly disappointed. The simple network architecture was excellent at learning the processed binary images, but now the far more complex and varied images are completely impenetrable for it. Regardless of how long we let it train for, it never makes any improvements on its decisions, always randomly fluctuating around an extremely high failure rate. In fact it is so bad at recognition, that it fails to fully recognise even a single grid. It seems we’re going to have to get a bit more creative.

Copying Nature, Again

Last time, when we were faced with a difficult problem we looked to nature for inspiration and this new problem is no exception. During the ’50s and ’60s, a very smart man and another very smart man conducted research inserting electrodes into monkey and cat brains. Their findings were quite brilliant, discovering that the visual cortex of animals contain neurons that fire respond to only certain parts of the field of view. This may not sound interesting at first, but it appears that it is quite critical in allowing the complex perception of images and visuals of which animals are capable. At the very least, some folks in Sweden thought it worthy of giving out one of their little baubles.

So in the spirit of further plagiarising Mother Nature, this technique was applied to a neural network by Yann LeCun in the late ’80s, ultimately coming up with a design called a convolutional neural network, with his particular digit recognising model nicknamed LeNet.

How Does Photoshop Work?

The basis of this method is an implementation of a mathematic concept, convolutional image filtering, a neat matrix multiplication trick that has unexpectedly cool results. This involves convolving a kernel, a small matrix, with a matrix representation of the actual image. I concede this makes no sense yet and has only served to make you think of popcorn, but this animation should help to demonstrate what’s happening:

**Figure 2:** Filter kernel, image matrix and the output after convolution is performed.

**Figure 3:** Animation demonstrating the convolution operator. Shows a 3x3 filter on a 5x5 image with 1 pixel of zero-padding and a stride of 1.

Clear as mud. This is what’s happening:

Pad the image with zeroes, this ensures that the output image is the same size as the original one. Note that there’s other methods to account for this in image processing, but zero padding is common when building neural networks.
Place the kernel in the top left of the image and convolve it with the pixels from the original. This means multiplying each respective pixel value together and summing them up over the entire kernel. Note how this differs from matrix multiplication.
Move the kernel 1 pixel at a time, this distance is called the stride.
Convolve and repeat until the entire image has been covered.

You are probably wondering what the point of all that was, as we appear to have a completely garbled matrix at the end of it. “How on Earth can that be remotely useful” I hear you cry. Depending on the kernel you choose, this operation has pretty profound effects when applied to images:

**Figure 4:** Various kernels with their effect on an image when used as a convolutional filter. Sourced from Wikipedia

As it turns out, we can achieve blurs, edge detection and much more with this simple operation: enabling millions of people around the world to prettify their cat pictures before uploading them to Instagram.

Back to the Brain

The clever bit that Yann LeCun figured out is that we can use this technique in the learning process for image classification neural networks. Instead of getting the network to learn weights between interconnected neurons, we can restructure the network to learn the values for the kernel itself. Admittedly, I’ve been working on this for a few days now and it is still far from clear to me as to why this should help with the classification process, but stick with me: the payoff is worth it.

An important concern when building a network of this kind is determining how many features (another name for kernels in this context) to train. It is typical to learn multiple kernels for each convolutional layer, the number of features trained is usually called the layer depth and the collection of features called the feature map. This increases training time but also allows greater complexity in the network potentially improving the classification capability. If this sounds daunting, it is, but I will show some examples later to clear things up.

ReLU Beats Sigmoid

Last time we wrapped the neuron activation in a sigmoid function to normalise the result and encourage learning to be fastest when the activation is lowest. The gradient of sigmoid decreases substantially the higher the activation becomes and this effect is increased in complex, deep networks with lots of layers. Ultimately it leads to a slow learning process and a higher likelihood for stagnation. Instead, it has become popular to use a rectified linear unit (ReLU) function instead:

**Equation 1:** ReLU function, 0 when x is below 0 otherwise the identity function.

**Figure 5:** A comparison between the sigmoid function used in the simple network and the ReLU function often used with convolutional nets (source).

When applied to images, the ReLU function will neutralise the dark colours of the image. In our neural net, the ReLU function will be applied to each of the feature maps in the convolutional layer. If that sentence sounds like gibberish, perhaps a picture demonstrate it better:

**Figure 6:** ReLU function applied to a grayscale of Rene Margritte’s The Son of Man.

Reducing the Image

The next ingenious trick we are going to use is pooling. The most common type of pooling is max-pooling which is a filter that takes the largest pixel value in the coverage area. The result of this downsampling is that it reduces the size of the image whilst maintaining the recognition of features in the image. The rationale is that if there is a feature detected in the original map it will have a high activation function, that means it will effectively be “recognised” even if we ignore some of its neighbouring pixels. It will however, reduce the importance of the location of this feature relative to other pixels in the image. This is really good for image detection as it allows us to pick up on features regardless of their position in the original image, ultimately reducing overfitting. Furthermore, by making the images smaller we reduce the amount of computation that needs to be performed.

**Figure 7:** Example demonstrating 2x2 max pooling operation on a 4x4 image.

Dropping Out, Just Not Out Of School

Convolutional neural networks are really ridiculously powerful. Too powerful in fact. They have such freedom and complexity that they can adapt to almost any data variables they’re given, particularly when constrained to images. Whilst this sounds handy for us, and it is, it leads to problems of overfitting. This is when the model learns the training data too closely and isn’t able to apply what it’s learned to other, slightly different, situations. Think of it a bit like wrote learning at school, where kids can learn to recite a spectacularly abundant cornucopia of useless information but are unable to solve a problem when it is structured differently, or means applying some of their techniques laterally. Careful how much you stretch this analogy, because human children are broadly speaking the best and most adaptive generalised thinkers in the known universe, but you get the idea.

**Figure 8:** Example demonstrating overfitting. The green line represents an overfitted model, that whilst is perfect in this set is unlikely to be successful with unseen data. The black line shows a more generalised model. Sourced from Wikipedia.

One of the ways we can deal with this problem is by implementing dropout. This is technique where neurons are randomly removed from the network layers during training with the selection happening randomly for each sample. Then at the end, when we come to using our network for testing, we use the whole network, but scale the weights for each neuron in accordance with the overall probability that it appeared during training. By thinning out the layers, the network is forced to build different relationships than it would have with the availability of a dense network and usually makes the network more robust to unseen problems.

**Figure 9: Left:** Standard neural network with 2 hidden layers. **Right:** Example of a thinned network after applying randomised dropout (source).

When Oscar Wilde said that everything is about sex aside from sex, he was acutely unaware of how prophetic his words would prove in the field of computational mathematics. Part of the motivation for applying dropout to improve the resilience of a neural net came from the idea that evolution generally favoured sexual reproduction over asexual reproduction. Presumingly it didn’t do this because one is obviously a great deal more fun than the other, rather it must have been more effective at gene propagation. It seems somewhat intuitive that asexual reproduction could be more effective, perfectly passing down fantastic genes to permanently immaculate and increasingly annoying offspring. Especially when juxtaposed against the continually disappointing and underachieving offspring of randomly mutated sexually produced offspring. However the evidence suggests that perhaps over a long period of time, the criterion for natural selection isn’t individual fitness of the organism as a whole, but the ability for the genes to be effective regardless of their compatibility with other genes. Given that a gene cannot guarantee its partners, the ability for it to be successful with a random set of partners determines its longevity.

Wrapping Up

The next layers of a convolutional neural network are fully connected layers, with each neuron from the feature map connecting to a neuron in the next layer with a weight and a bias. This is just like the original network described in the last post. It is typical to have at least one densely connected layer with a lot of neurons in order to capture as much non-linearity in the feature maps as possible. The final layer is a classification or completion layer, which has a number of neurons equal to the possible output classifications.

We now have a vastly complex set of weights and biases relating between various filters of the image, inter-connectivity and eventually classification. The values for all of these are calculated using the same back-propagation technique previously described. The model is regularised using dropout to help prevent over-fitting.

If you are struggling to make sense of all this, fret not, for Adam W. Harley has truly gone above and beyond and produced a free, interactive visualisation illustrating the whole thing. Draw a digit in the top left and you will see the result of the pre-trained filters and weights and biases in action. I can’t stress enough how awesome this is and I encourage the intrepid amongst you to read his accompanying paper on the topic.

**Figure 10:** Visualisation of a convolutional neural network for handwritten digit recognition (source by A. W. Harley).

Teaching Bernard

It should come as sweet relief to you to hear that we are near the end of the tedious lecturing part of this blog and will get started with the tedious coding. You might suspect that this sounds awfully complicated and there’s a hell of a lot of difficult maths to perform and a lot of difficult code to implement. Luckily for us, Tensorflow has come to the rescue and will do the heavy lifting for us. This is completely fine as I’ve shown that when you show off the final product, people will believe that you’ve both read and understood how it works, no matter how laughably untrue that is.

Source

First we declare a couple of placeholders, x being our input layer of 784 neurons for the 28x28 pixel image. y are the 10 neurons for each possible classification, 0 to 9. Next comes our first convolutional layer which will have 32 features each formed using a different 5x5 convolutional filter. That is 32 different image filters we’re training to form the first layer of the network.

Source

The conv2d function declares a convolutional operator with a stride of 1 nad the zero-padding calculated dynamically so that the output is the same as the input. max_pool_2x2 declares a basic 2x2 max pooling function, with a stride of 2, but this _will_ reduce the size of the image in the same way as shown in Figure 7. h_conv1 is the first convolutional layer, formed by convolving the input image, with the weight tensor (that has a depth of 32), adding 32 biases and then applying the ReLU function. That one line has a deceptively large number of maths involved in it. h_pool1 is the first max-pooling layer, downsampling the image to 14x14 pixels.

Source

To improve our model, we add a second convolutional layer, this time with 64 5x5 filters. The third dimension of the w_conv2 tensor is 32 because that is the number of features we have from the previous layer. So all of the features from our first layer are connected to the second. Once again max pooling is applied, downsampling our image to 7x7 pixels.

Source

Then we add a fully connected layer that has 1024 neurons without convolution. Note that the dimensionality is 7x7x64, which is the number of feature maps from the previous layer multiplied by the number of pixels in the image after downsampling. Note that we are still using the ReLU function instead of sigmoid.

This part implements the dropout. Note there is a placeholder for the probability of keeping each neuron.

The last part is a completion layer with 10 possible outcomes, one for each digit.

Time For Training

Like before, we define a training function that minimises cross entropy and specifies an optimisation mechanism. This time we’re using Adaptive Moment Estimation (Adam) instead of gradient descent. I’ve already gone through a lot of theory here, and Adam isn’t conceptually different from gradient descent, it just has some optimisations that make it more computationally efficient and effective. With that in mind, here is the whole training function.

The code might be a lot to stomach, but if you go through line-by-line, it shouldn’t be too different to what we had in the first example. The main thing to note is the keep_prob values we are using. When we train the dataset, we want to dropout nodes but we want to include them when we test, that’s the reason it switches between 0.5 and 1.0. Now all we need is a good book, a movie of your choice or some fresh paint you need to watch dry.

Step 0, training accuracy 0.02
Test accuracy 0.0459259
Step 100, training accuracy 0.66
Test accuracy 0.72642
Step 200, training accuracy 0.8
Test accuracy 0.804938
Step 300, training accuracy 0.92
Test accuracy 0.842469
Step 400, training accuracy 0.88
Test accuracy 0.871111
Step 500, training accuracy 0.94
Test accuracy 0.885185
Step 600, training accuracy 0.96
Test accuracy 0.900494
Step 700, training accuracy 0.98
Test accuracy 0.907654
Step 800, training accuracy 0.98
Test accuracy 0.914815
Step 900, training accuracy 0.98
Test accuracy 0.921975

It now looks like he is starting to get the idea after a 1000 iterations, but needs quite a bit more time to get those percentages up. I ran this for many thousands of iterations, running for a few hours before it finished and it spent the whole time billowing steam and whirring like a jet engine. However, when I returned, Bernard had accomplished something quite remarkable. His accuracy for recognising digits in the test set was: 99.8%. At the level of board recognition, he solved all but one (99%). This was far more accurate than anything I had suspected was possible from a cheap pre-fabbed algorithm, 2 weeks after I first started learning about the method, ran on a standard laptop. If you don’t believe me, this model is the one published in the GitHub repo under data/best-model. I have reason to believe that it should be quite resilient to new, unseen images as well. This included solving some boards that I thought would be way out of scope for this effort because the image quality was so low:

**Figure 11:** Examples of solved images.

This is the End

And so that brings us to the end of Bernard’s Epic. I’m not sure whether it will be remembered as well and for as long as Homer’s, but only time will tell. What started as some coding practise and a way to sardonically mock perfectly innocent participants of a universally popular pastime, has culminated in the creation of a being fully capable of indulging in the mockery himself. If there’s nothing else you take away from this, series, please take away the idea that this is actually pretty easy — and you can do it too. Bernard was born out of 2 weeks of boredom and access to Google. This really is an indication of how far this field has come and how accessible it is to everyone. I hope my puny contribution to the reams of tutorials and literature on this topic prove useful to at least some of you, and I implore you to go build your own brains to try to outwit Bernard. And if you’re not as sad as me, perhaps this bumbling foray into the world artificial intelligence may give you an idea of what is possible in the world today and in the future.

P.S.

I apologise again for this final entry being so late (over 2 years since I started this project), I was fairly busy after the aforementioned interview, and this sort of fell off my radar. Someone on Reddit suggested it would be cool if this worked for real-time video, so please accept this feeble example of British-handicraft as a formal apology (code here):

Acknowledgements

Yann LeCun’s explanation of LeNet without which none of this could have been possible. The implementation of his theory is basically indistinguishable from magic and for that he must be applauded.

Ujjwal Karn’s blog: An Intuitive Explanation of Convolution Neural Networks. I found this to be a really good explanation that nicely tied together some very good resources.

Adam W. Harley’s paper: An Interactive Node-Link Visualization
of Convolutional Neural Networks and the hosted visualisation tool. This is just amazing at intuitively demonstrating what’s really going on in this complex process. For those interested, he has also made one for the basic deep network we made last time. Less helpful but perhaps prettier are the 3D versions that can be found on his site.

Adit Deshpande’s blog: A Beginner’s Guide To Understanding Convolutional Neural Networks, another well explained resource on the topic.

Nitish Srivastava et al. and their paper: Dropout: A Simple Way to Prevent Neural Networks from
Overfitting for a thorough analysis and explanation of the dropout technique.

Diedrik Kingma and Jimmy Lei Ba: Adam: A Method for Stochastic Optimisation introduces and thoroughly explains this enhanced alternative to gradient descent optimisation.

Solving Sudoku: Part IV

Improving on Mediocrity

Copying Nature, Again

How Does Photoshop Work?

Back to the Brain

ReLU Beats Sigmoid

Reducing the Image

Dropping Out, Just Not Out Of School

Wrapping Up

Teaching Bernard

Time For Training

This is the End

P.S.

Acknowledgements

Written by Nesh Patel