“Machines can’t think” is a belief largely held by most people, that’s because our machines are best at crunching numbers, performing very precise calculations and generally doing tasks that have certain inputs and outcomes. Well, things have changed!

Of course, machines can’t think like humans do, they think differently. This post is the first in a series i would be doing on neural networks, and i would demonstrate precisely how to build machines that think.


The answer dates back to a very long time, far back as the times of Charles Barbage. Computers were designed to calculate numbers. They were primary calculating machines at inception. Today, these calculations are combined in a lot of very sophisticated ways to enable amazing features like the Internet, Graphical User Interfaces, Communication, Video and Audio Recording and so much more that we all take for granted today. But all these things still rely on “Absolute Certainties” By these i mean, to implement all of these, we only need to write very sophisticated algorithms that follow very specific steps. Turns out that all of these features could be done using manually written algorithms, but thinking? Doesn’t rely solely upon absolute certainties like 2 + 2 = 4.

Hence, things like speech recognition, generation, synthesis and understanding, Image classification and Object Detection, Text understanding, generation and sentiment classification, Time Series data analysis and a host of other things which humans can effortlessly do, is incredibly hard and often impossible for computers to do. In fact, the trivial task of recognizing the difference between a cat and a dog is practically impossible for a computer to do well. At least until few years ago. The reason is that thinking involves probabilities based on different observations, and since computers do not operate with probabilities, they couldn’t think.

In light of this realization, great computer scientists decided to apply the science of probability theory and created statistical learning. Paving way for algorithms based on probabilities. Their work evolved eventually and led to the rise of Neural Networks, enabling computers to really think and solve problems that were once thought impossible. Like i earlier said, Computers can’t think like humans, they think differently, hence when you hear neural networks, do not think we have succeeded in replicating the human brain, its all still mathematical functions.

Without further Ado, i will honour the title of this post by “INTRODUCING” you to neural networks and how they enable computers to think and solve problems.


Neural networks and all other fields of statistical learning perform tasks through the following.

  1. Design a good learning algorithm (Would explain these soon)
  2. Feed a well represented Dataset for the algorithm to learn from.
  3. Use the model generated from the data by the algorithm to predict outcomes based on newly fed data.

To illustrate, to build a image recognition system for cats and dogs.

  1. We design a good learning algorithm that can properly extract cats and dogs features
  2. We collect large sample images of different kinds of cats and dogs, our algorithm would build a model of cats and dogs from this data.
  3. We feed in any new image, and the model would tell us the probability of it being the image of a cat or dog.

That’s simple enough.

Neural networks are made up of Artificial neurons, similar “in concept” to neurons in the human brain. These neurons are connected to each other, forming a lot of connections, and the system works through the activation of these neurons.


  1. Neural networks are made up of many of connected neurons.
  2. Each neuron activates in the presence of certain observations.
  3. The algorithm learns from the data, which neurons to activate in order to predict a certain class of features.
  4. The learned activation is called the model.

Using our cat and dog example, lets say we have three neurons in our Neural Network.

Neurons A, B, C

After training on the data and assigning neurons A, B and C to important features that make up cats and dogs, the algorithm might learn that when activations A & C are activated, the image belongs to cats, but if A & B are activated, the belongs to dogs.

This should give you a feel for how neural networks works. In a real system, there would be thousands, millions and maybe billions of such neurons, this would ensure we can correctly represent more features and hence build better models that makes better predictions. The human brain has over 100 billion neurons.


[Michael Nielsen ( ]

Above is a toy Neural Network.

The first three small circles are the inputs from the data, in image recognition, this would be the pixels of the image, where each pixel is represented by a small circle. As seen above, the input layer is fed into a layer of four neurons. These four neurons learn the features of the input and constructs an activation map based on what it has learnt. Finally the activations of these four neurons is fed into a single output layer that performs the actual prediction based on the state/activations of the four neurons. These output can be the probability of our input being the picture of a cat.

The type of network described above is called a “FULLY CONNECTED” network, reason is that all the neurons in the first layer is connected to each neuron in the next layer. Other types of networks exist for different purposes, but for now, this discussion would be limited to fully connected networks.


So far, I have mentioned neurons learning features of inputs and connected to each other. But a number of questions remain open.

What is a neuron?

How do Neural Networks learn the features of a dataset?

A neuron is a mathematical function that receives inputs and gives a single output that represents the result of the computation on the inputs. The exact form of the function defers, but the first artificial neuron was called the perceptron. The perceptron received a binary input and computes the linear function Wx + b, where x is the binary input, W and b are the parameters of the neuron,these parameters are inferred from training examples via machine learning.

The perceptron outputs 1 if Wx + b > 0 and it outputs 0 otherwise.

However, the linear nature of the perceptron makes it highly unsuitable for most real world problems whose probability distributions are often non-linear. Aside this, they deal only with binary inputs and outputs, hence they are incapable of learning the true nature of real training data.

Better activtion functions such as Sigmoid, tanh, RELU and its variants and MaxOut exist, of these, the RELU activation is the most battle tested and preferred activation function.

Its also the simplest function

It takes the form max(0,z)

where z = Wx + b

It simply returns 0 when the result of the function is less than 0 and it returns the output of the function when the output is 0 or higher.


Deep Neural networks are made up of thousands and sometimes millions of such neurons, networked layer by layer.

These setup has made Speech Recognition and Understanding, Language translation, object detection and classification and advanced analytics possible.

[Michael Nielsen ( ]

Depicted above is a basic neural network to classify Handwritten digits.


Deep neural networks works so well because they decompose problems into sub-problems of sub-problems of sub-problems, these allows them to learn very accurate representations that is invariant to variations in the way an input is presented. The first layers learn basic concepts of the input, the next layers learn the representations of the first layer and it goes on and on. Hence, deep neural networks models each layer as a function of the previous layer.

Ln = f(Ln-1)

These allows Deep Neural Networks to fully discover the features of images and speech signals all autonomously, with minimum engineering of features.

Hopefully, these gives you good insight into what neural networks are.

Practicals is the best way to learn anything, hence, we would be building a very effective image recognition software for the rest of this tutorial. Also, i would explain all the other components required to build and train neural networks, as we setup the image recognition system.

I am assuming you are familiar with python, if not, you might head over to some excellent python tutorial and come back when you are done with the basics.

Ensure you have the following installed,

  1. A good python IDE, i recommend PyCharm
  2. Python 3.5 and above

We would need some additional python packages


Google Tensorflow is the number one deep learning library, its incredibly robust.

Install tensorflow by running the command,


if you have a NVIDA GPU installed on your system.

Neural networks run much faster with GPUs
For more details, visit

Once done, you need to install another awsome library called Keras.

Keras is an API on top of deep learning libraries. Using pure Tensorflow code is a little bit harder, keras is simpler and its also a common abstraction.

Install keras by running the command

For more details, head to

Also install h5py, we shall need it for saving generated models.

Now we are setup

Open your IDE and create a new python file, give it whatever name you wish.


Here is what we are trying to do, MNIST is a dataset of 70 000, 28 x 28 pixel images of hand written digits, (0–9).

We want to train a neural network on 60 000 images, and then use out trained model to predict the correct class of 10 000 images.

These Dataset was compiled by the great legend of computer vision, Yann LeCun,

more information about MNIST is available from

First import some modules as seen below

Next load the dataset

Ensure your laptop is connect to a WIFI network, run the code .

Keras would automatically download the dataset (15 mb in size)

train_x and test_x refers to the training and test images, while train_y and test_y refers to the labels

The output should be

Next flatten the image into 784 pixels (28 * 28)

Note that these leads to loss of the 2D structure of the image, for now, we shall pay that price because we are developing a simple model. In next tutorials, we shall avoid doing this. (Bear With Me)

Next, Neural Networks can’t understand string labels, so we have to convert all our labels to one hot encoded data. I would explain what one hot encoding is in the next tutorial

Now our data is fully ready to be trained. The next thing is to define our neural network.

The above is very straightforward. First, our model is a sequential list of layers of neurons.

Once the model is defined as an instance of the sequential class, we add layers to it using the add() function.

Each layer is called a Dense layer because very neuron is connected to every other neuron before it.


The units is the number of neurons in each single layer, as seen above, there 128 neurons in each layer of the network. We can set this to any number of neurons, but beware. The more the number of units, the more computationally expensive your model is, practically, never set your units beyond 512.


The line input_shape=(784,) might seem confusing, its very simple, note that we have 28 x 28 = 784 pixels as our image dimension, so we have to specify it in the first layer. Keras would infer the shape in the next layers.


I already explained activations before, for each layer, we have to specify the activation function to use. RELU is most used.


This can seem confusing too, first notice we specify just 10 units. The reason is this, in our output, we want to get a vector (a 1 dimensional array) of the probabilities of each image belonging to any of the 10 classes of digits (0–9), so we specify 10 units in our final layer.

Also our activation is now softmax, I didn’t mention softmax while talking about activations earlier on. The reason is that, it is a special type of activation, the exact form of which i would explain in the next post on this series. It only appears in the final layer of a neural network, so unlike RELU you can’t put it in a layer before the final.

Softmax is an activation function that takes a set of scores for N classes, 10 classes in this case, and transforms this scores into probabilities in such a way that the total sum of the probabilities is exactly 1. I would explain fully the way these transformation occurs in the next post.

Up Next, we need to specify the specify a few components that our network needs to train on the data.

Here we ask keras to compile our model with the right components, now we shall go through the components one by one.


When neural networks train, they update the parameters (Weights and Bias) step by step at a rate defined by a parameter called the learning rate. The optimizer defines exactly the way the parameters would be updated.

The most common optimizer is called Stochastic Gradient Descent(SGD)

Stochastic means randomly, Gradient means slope and Descent means to go lower

SGD updates the parameters in this way, the partial derivatives of the parameters with respect to a defined loss function is computed using a technique called backpropagation, next we multiply the gradient by the learning rate and subtract it from our current parameters value.

If these confuses you, never mind , i would explain more in later posts. Just note the 0.001 we pass in as learning rate


I already mentioned that SGD uses a loss function to compute derivatives, here the loss function we used is called categorical cross entropy, i would explain this later too.

Finally, we specify the metrics, this is no math stuff, we are simply saying we want the model to report our accuracy back to us. NO BIGI

That explains the compile function.

Next we add

This is very simple, we feed in our training images (train_x) and their labels(train_y) , we also specify a batch size, these is very important, to prevent processing all our training data at once, we have to specify the size of the number of images to load at once. At most, use a batch size of 200.


These prints the accuracy of our model on the test data, it tells us how well we perform

We run a single evaluation of the accuracy.

Run the script and watch the numbers,

This model would run for either about 5 minutes on a good CPU, or very shorter time on a GPU.

Final accuracy is: 0.95750000000000002

Check the last log in your console to compare your result.


Yes, machines can think. Machines can think deeply enough with a four layer deep neural network so much that with 95.7 % accuracy, they can tell what digit an handwriting belongs to. Machines are not conscious, but they can think through the activations of artificial neurons.

Any questions and comments are welcome. Post them below and I would respond as appropriate.

The full code for this tutorial can be found on GitHub

If you enjoyed this tutorial, please give some CLAPS. Thanks.

Reach me on Twitter @johnolafenwa


Software Engineer at Microsoft | Creator of TorchFusion (

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store