Solving Sudoku: Part III

15 min readNov 3, 2017

If you are not yet familiar with the renowned puzzle solver Bernard, I highly recommend familiarising yourself with the first and second chapters of his epic tale. In those episodes we got to a point where he could look at a photograph and understand where the grid is and where each of the digits are. However that’s still not good enough as Bernard just sees a set of oddly shaped blobs, at this point he doesn’t even know that there are any significant differences between them. We need to teach him numbers and to do that, we must give him a brain.

Figure 1: Original image, with the extracted grid and our current, nonexistent, interpretation.

The task is pretty daunting compared to our previous efforts. Looking at the image above, it is very clear to a human what the numbers should be but for each number even within this one image there are subtle differences between digits of the same type. Edges may be slightly rougher, segments of the numbers may be slightly cut off. Also this is just a single image, imagine if we were trying this on other photos. Our work on the extraction method has compensated for size and contrast quite well, but there will be images with different fonts, blurriness and a seemingly endless host of other subtle variations humans will never notice when looking at the photo. It’s true that we could use relatively simple comparison-based techniques to identify this board but we should be far more ambitious and try to build a mind of which Mary Shelley would be proud.

When in doubt, copy nature. When thoroughly expert in a subject, copy nature better.
— Any engineer, ever

That’s a motto I thoroughly believe in, particularly with respect to products of evolution. Various non-evolutionary systems in nature are still better than anything humans can conceive, like stars producing energy. The products of evolution perhaps surpass even this, producing the world’s most efficient mechanisms for pretty much anything you can think of. This is why the robots in the Matrix wanted to use humans as efficient batteries. Perhaps though, those robots were foolish and instead of using humans as batteries, should have considered using us as processors.

The brain is the most complex thing in the universe.
— Anyone who has studied the brain, ever

It’s pretty well recognised that the human brain is the most sophisticated system in the currently known universe. It is capable of an incredible range of thought, creativity, wonder and of course, stupidity.

Understanding The Most Complex Thing In The Universe

Inside the brain are neurons, cellular structures responsible for electrically communicating information and generally the more complex the brain, the more neurons it contains. I say generally, because the African Elephant has approximately 3 times as many neurons as do humans, but have won fewer than approximately 3 times as many Nobel prizes. Very simply, a neuron takes synaptic input and produces a synaptic output, analogous to a function in computing. Then these neurons are interconnected to form a hugely complex network.

Figure 2: Neural network map of a mouse brain. Image from Allen Institute for Brain Science.

Above is a map of a mouse brain, with around 71 million neurons interconnected. The human brain has approximately 86 billion neurons, all different, all interconnected. Cue stunned silence and gasping.

Building Frankenstein’s Monster

If we thought what we’re doing before was daunting, surely now people are running for the door. But fret not, we are going to start simple. Bernard isn’t going to need a brain powerful enough to paint a masterpiece on the ceiling of the Sistine Chapel, he just needs to learn the digits, 1–9. As such we don’t need a proper brain, just a crappy, loose approximation. Furthermore, someone vastly more intelligent than I, theorised how to approach the mathematics of interconnected neurons way back in the 40s and the resulting computer program became known as a neural network. Since then advancements have been made consistently in the field and we’re now at a point that we can make fairly impressive stuff for ourselves using our laptops.

A neural network program consists of layers of neurons, where each neuron is an object that holds a number between 0 and 1 called its activation. The output activations of each layer serves as part of the input for the neurons in the next layer and so on until we reach the output.

Figure 3: Structure of a standard deep neural network.

In the image above, the first layer of neurons is the input layer, providing the stimulus for the network. The final layer is the output layer, with a neuron for each possible outcome that can be achieved. The hidden layers are where the intelligence lies. The theory is that these hidden layers will pick up on different features in the input and the combination of these features will indicate the right output. For example, the number 9 could be represented as a loop near the top and a vertical line and then a loop could be represented by 4 curves all placed adjacently. The idea is then that different stimuli in the input layer will cause different activations in each subsequent layer and eventually activate the correct neuron in the output layer.

Each connection between the neurons has a weight associated with it, which is a number that can be positive or negative, set arbitrarily to begin with. The activation of a neuron is a weighted sum of all of the activations of connected neurons in the previous layer. We also want a threshold below which the neuron is inactive, we can do this by just subtracting a term which we call the bias. Then the activation of each neuron is part of the input for each neuron in the next layer. This passing of information sequentially through layers is called feedforward.

Equation 1: Weights and bias for a given neuron, where w is the weight and a is the activation for a neuron in the previous layer. b is the bias for the activation.

Figure 4: Weights chosen to specifically pick out a vertical line. Green represents positive weights, red negative, otherwise 0.

The image above shows an example of what the weights of a neuron in the second layer could look like, accepting the input from the first layer if it were specially configured to find a vertical line in the centre of the image. If we take the weighted sum of all of the activations, it will be highest when there are white pixels in that area with black pixels on either side. You can see then how the weighted sum would be lower for 2 than it is for 1. In reality we won’t be setting these weights or biases ourselves, the network will learn the values itself and the results won’t be anywhere near as neat and tidy as what is shown above.

At the moment, the sum of weights and biases can be any number, but it is typical to apply a function to the weights and bias to constrain their values between 0 and 1.

Figure 5: Plot and formula of the sigmoid function.

This activation function changes the way the neuron behaves and in particular, how it learns: for example the sigmoid function will mean neurons learn slowest when the weights are at the extremes. This gives our activation function as:

Equation 2: Activation function of a neuron i, where j indicates the connected neuron of the previous layer.

For our example of digit recognition, our input is a 28x28 square grayscale image. We can model that as an layer of 784 (28 x 28) neurons, where each pixel intensity (normalised between 0 and 1) is the activation. Then the output layer should have 10 neurons, one for each classification 1–9, and an extra one for 0 (blank). Then we will use one hidden layers in between with 16 neurons, an arbitrary choice but sufficient for the example we are trying to solve.

Figure 6: Diagram of the neural network we’ll build for digit classification.

Even in this small and simple network we have 810 (784 + 16 + 10) neurons with 12704 ((784 * 16) + (16 * 10)) weights and 26 (16 + 10) biases. Imagine a huge machine with 12730 knobs and dials on it, whereby tweaking each knob even slightly might completely change the result in ways you can’t immediately predict and we have something approximating our network. Trying to adjust these manually would be almost as torturous as completing the Sudoku yourself. Thankfully, like the brain, this network structure supports the ability learn.

How Does It Learn?

When teaching people, it is often helpful to show examples of the correct solutions and the same applies here. Ultimately we will be showing the network a set of pre-classified images, essentially saying “this is a 1, this a 2…” and hoping that the network will learn the weights and biases it needs for each neuron in order to correctly classify new images. The key to doing this is establishing some criteria by which the network can assess its success. This is done in the form a cost function which constitutes some numerical representation of the difference between the actual values and the predicted ones. A commonly used cost function is cross entropy, given by:

Equation 3: Cross entropy cost function averaging over n training samples denoted by x. y is the predicted value and y-bar is the true value.

This function is higher the further away from the truth we are. As such, we want to adjust the weights and biases in such a way as to minimise this function. If you’ve done A-level maths, you should be chomping at the bit to use calculus to find minima and maxima. However, unless you’re a Fields medal winner with a super computer in your head, it’s extremely unlikely you would be able to analytically minimise a function with over 12,000 parameters.

Instead, we use an approximation called gradient descent. This works by choosing a point in the function at random, calculating the gradient at that point and nudging to the left if it is positive and the right if it is negative. The size of the nudges is known as the learning rate and has implications for your network: too high and it will struggle to optimise, too low and it will take too long to learn. Performing this step repeatedly (until the gradient reverses) finds some local minimum in the function. At the end of the day, this is still calculus, so pay attention in school.

Figure 7: Illustration of gradient descent of a polynomial using two random starting values. Notice that they find different minima in the function.

The classic analogy here is of a ball rolling down a hill but this quickly becomes abstract in our situation which has many inputs and parameters, vastly increasing the input space from the simple example shown above. However, the maths holds up for multiple variables and so we can implement this method to minimise the cost.

The de facto implementation of gradient descent (and other optimisation methods) is using backpropagation. This works by first calculating the error in each of the neurons in the output layer, which is a function of the rate of change of the cost function with respect to each output activation. The reason that it is called back propagation is that we can use the errors in the neurons of the output layer to calculate the errors of each neuron in the previous layer and so on, until we reach the start. It is the reverse of what we did earlier when feeding the weights and activations forward through the layers. The maths gets a bit lengthy to explain here, but I’ve provided a few links in this paragraph and in the acknowledgements section for the particularly brave. For the cowardly, treat backpropagation as a system whereby we estimate how much a particular weight (neural connection) affects the overall cost of the network and then shift it accordingly. This is what occurs for a single training run of our network, we simply repeat that over and over with new training data until either it learns to solve the problem, or becomes stuck in a permanent state of familiar ignorance.

Enough Talk, Time For Action

So that’s the theory, which is all well and good, but time to put it into practice. As before, I’m using Python 3 and am using the TensorFlow library to do the heavy lifting for the neural network. It can be installed easily with pip or conda and we don't need to worry about CPU or GPU optimisations for this example, just ignore the warnings when they inevitably pop up. As before, I'll supply code as gists, and will release the complete GitHub project next time around in the final part. I encourage you to follow along, and hopefully you've attempted the last two problems but either way this is once again independent and I will provide training data later if you just want to focus on the network.

Tensorflow works by first declaring a graph that describes your network and then runs sessions to perform training and prediction. We will be implementing the simple network shown in Figure 6 that has one hidden layer in between the input and output layers. Remember that our input images from the Sudoku grid are 28x28 pixels big (cough MNIST cough) which gives us 784 neurons to play with, one for each pixel intensity in the image.

One thing to notice here is the shape of these objects. Tensors (from which the library gets its name) are a bit like matrices (but with important differences) in that they have some shape and dimensionality. Our placeholder for x, the input has a shape of [None, 784]. The missing value will be filled in by the number of input images we choose, e.g. 81 for each cell of a Sudoku grid, each with 784 pixel values. The same is true for y_, but the second parameter is 10 as that is the number of possible classifications, 0-9. The shapes of these tensors are important when defining the layers, as we shall see.

Now we have our input layer, x, we can implement our hidden layer, which is calculated from our activation function and weights and biases from Equation 2.

The weight and bias functions simply declare variables of our desired shape, giving them a small positive value. This prevents them from beginning at 0, which reduces the likelihood that the neuron will get stuck when learning. w_1 is a set of variables mapping between the input layer (784 neurons) and the hidden layer (16 neurons), these are all of the connections between the two layers. Additionally a bias is declared for each neuron. Lastly we implement Equation 2, using TensorFlow implementation for sigmoid (Figure 5). tf.matmul simply means "matrix multiplication", so this step is very straightforward. Next, we declare the output layer, feeding forward the activations from the hidden layer.

The output layer has 10 neurons, one for each possible outcome. w_2 declares all of the weights from the hidden layer to the output layer and b_2 is a bias for each neuron. We’re not using sigmoid here which should look a little strange. Normally, we would want to use a variation of sigmoid called softmax, which is like a multivariate version of sigmoid: instead of producing a single value between 0 and 1, it produces a probability distribution across all the outcomes that sum to 1. That is 10 numbers of which the highest is our prediction. We are skipping it for this line because Tensorflow has its own implementation as part of the cost function:

The cost function uses cross entropy as well as softmax, taking the predefined labels and the predicted values as an input. This is all wrapped in a reduce_mean function which takes the average across all training items. The training step declares a gradient descent optimiser with a learning rate of 0.5 and is set to minimise the cost function. Lastly, some functions are declared to check the accuracy of the system, these just take the most probable predicted digit and compares it with the expected digit. Now, the moment of truth, running the application. No doubt you have run this already, only to find that nothing happened. That’s because Tensorflow works declaratively, we need to initiate a session if we want to execute this network graph we’ve defined.

More importantly than that, we need some data to train it on. Have no fear, I have Blue Peter’d a dataset for you using Bernard’s image extraction algorithm. The dataset is split into a training set, with all the digits from 28 different Sudoku images, and a test set, which has only the digits from the original image at the top of this page. To use this, you will need to download this Dataset class and put it alongside your code in a folder called neural_net, along with a blank __init__.py file.

Figure 8: File structure for using the Dataset class.

I’ve made the dataset I used in this blog public so you can download it and use it along with this guide. The format is a pickle which is essentially a file archive of a Python object. The dataset has a test set with all of the digits from the original photo shown at the top of this page, extracted and scaled using the algorithm described in the last post. It also has a training set of digits from 28 other photos of Sudoku grids, similarly processed, containing a total of 1135 digits. The Dataset class mainly does some housekeeping on the images so they can be used nicely with TensorFlow. The other thing it provides is a next_batch method which returns a random set of unique images from the training set, we'll use this to speed up training steps, giving the network all the data at once would be time consuming.

Now the moment you’ve all been waiting for, we can run this code and hopefully watch Bernard learning in front of our eyes:

Step: 0
Training Accuracy: 0.23
Step: 100
Training Accuracy: 0.96
Step: 200
Training Accuracy: 1.0
Step: 300
Training Accuracy: 0.99
Step: 400
Training Accuracy: 1.0

Test accuracy: 1.0

Success! It only takes a few seconds as we have a very simple network and only a small amount of training data, but we can see how rapid the progress is. Initially the training accuracy is poor, as we would expect it to be as the variables are all randomly initialised. However even after 100 steps, the training accuracy shoots up to 96%. At the end of the training we can see that the network is able to correctly classify the digits from the original image. Since we saved the model, we can restructure the network code so that we can only classify new images without training the model further. Tying that in with the solver and grid extractor and we have a complete solution:

Figure 9: Completed Sudoku board, parsed from an image. Overlaying the numbers over the board was done by performing the inverse of the `warpPerspective` function we did when we extracted the grid.

So What?

So he’s finally done it. Bernard has managed to solve a Sudoku grid from the image alone, but there should be a few questions popping up in your mind. For starters, we only tested Bernard on one image and trained a whole network to solve only this specific problem. If that sounds to you a bit like using Mjölnir to crush a walnut, then you’re absolutely correct. The real test is to see how good Bernard is in general. Over the course of this project I have endured derogatory stares on public transport to build up a test set of 128 photographs of Sudoku boards. I gave them all to Bernard to have a go at solving and he could only get about half right, 58%. This is not because the puzzle was too hard to solve, far from it, but because he was unable to correctly identify the board. Perhaps because the angle was too steep, the lighting was bad, the photo was blurry or it was in a font he didn’t understand or just failing for no reason obvious to a human. This is pretty poor and you might be thinking that this neural network malarkey isn’t worth the paper all those theses were printed on. However I disagree, this guide shows an excruciatingly basic network that is actually based on very old technology. The field of neural networking has come on leaps and bounds since then but it is difficult to explain the more complex networks without first understanding the basics. Join me next time, for the final part of this series where we use some of these techniques to vastly improve Bernard’s accuracy.

Acknowledgements

3Blue1Brown and especially his video series on neural networking. At the time of writing, only the first two in the series are out but they are absolutely incredible resources for learning this topic. In fact pretty much everything he has done is fantastic and I urge you take some time to watch some of his videos, you won’t find better explained mathematics anywhere on the net.

Michael Nielson’s online book on neural networking. This is a fantastic, free resource that explains in detail the mechanics of neural networks and the maths behind it. Highly recommend reading this to anyone trying to learn the subject.

Andrew Ng’s lectures, really well explained video series on neural networks.

TensorFlow, without those folks I would be forced into making a ham-fisted effort at handcranking the network. Particularly their tutorials MNIST for Beginners and Deep MNIST for Experts are excellent and wholly relevant to this exercise.