What is a Neural Network, anyway? Day 2: #100DaysOfML
Hi again! Thanks for joining me on our continuing adventure toward Machine Learning Mastery.

Today, as promised in yesterday’s post, we’ll be discussing the first chapter of Michael Nielson’s excellent (and still free!) book: Neural Networks and Deep Learning. From a high level, we’ll be talking about what a neuron is, how a neural network is structured, and try to motivate the intuition behind how a neural network learns.

Just before diving in, however, I’d like to give a shout-out to Grant Sanderson who may be more familiar to you as 3blue1brown. I can’t recommend his videos enough: they’re clear and approachable. His video entitled, “But what *is* a Neural Network?” is embedded below. Check it out if you’d like another vantage on this fascinating topic:
Using neural nets to recognize handwritten digits
Michael begins by asserting — I think correctly — that the human visual system is, “one of the wonders of the world.” When we see, we’re automatically and subconsciously making sense of the sensory data streaming into our brains.
Michael highlights the effortlessness with which we interpret the world and entreats us to appreciate the sophistication of our visual cortex by asserting that to write a program to recognize handwritten digits by hand (har har) turns out to be rather intractable algorithmically.
Enter the biologically inspired computational paradigm to solve our problem via a different approach! The neural network. At the highest level, the idea is to build a system that can learn from a large number of training examples: rather than craft an assemblage of explicit rules instructing the computer on how to handle individual cases, we somehow construct a system that learns what it is we’re showing it (after having digested a sufficient number of training cases).
As an aside: (and as you may know) it turns out that the handcrafted approach toward AI also exists and has proponents. While I won’t be covering this in any depth, I thought you may be interested in briefly exploring a different paradigm: that of so-called “expert systems” a la Cyc.
Perceptrons
To briefly summarize: A perceptron is a function (or neuron) that takes as input some number of binary inputs and outputs a binary output. Here is a general picture of a perceptron:

Each of the inputs is weighted by some tunable real number (so as to scale the relative “importance” or magnitude of the input). The total sum of all of the inputs is then compared against the neuron’s threshold value — another tunable real number that is a property of the neuron — and if the sum exceeds the threshold value, the perceptron fires (meaning it outputs a binary value of 1). If the sum is equal to or less than the sum of the weighted inputs, the perceptron fails to fire and outputs a zero.
Michael makes the point that one could construct layers of these perceptrons and, through this added complexity, build ever more complex and abstract systems.
Michael makes a notational simplification in moving away from the threshold of a neuron, as discussed above, instead defining and using the bias of a neuron. The bias, b, is defined as the negative of the threshold.
The rule for firing is now, rewritten with the bias:
Output =1 if the sum of the weighted inputs plus the bias is greater than zero.
Output = 0 if the sum of the weighted inputs plus the bias is less than or equal to zero.
A high bias will be like a high hurdle: harder to clear. A smaller bias will make the neuron easier to fire.
Michael then provides an explanation and construction out of perceptrons of the functioning of the logical NAND gate, which is a universal logic gate (meaning that combinations of NAND gates can produce the behavior of any other combination of logic gates). These logical gates are the building blocks of digital circuits and so, in principle, we could construct a computer out of perceptrons (side note: has anyone done this? I poked around google a bit but found nothing. Seems like quite the hobbyist’s endeavor!).
While intriguing, this isn’t (at least immediately) learning to recognize handwritten digits, and so Michael pivots somewhat by introducing the concept of learning algorithms and sigmoid neurons, to be covered immediately below.
Before moving on, I really wanted to bring Daniel Shiffman’s The Nature of Code to your attention. Like Michael’s book, The Nature of Code is free and has an extensive and excellent treatment of perceptrons including lots of code with use cases. Also, his youtube channel The Coding Train is great :)
Sigmoid neurons
Okay, so… why switch neurons? The motivation for using sigmoid neurons has to do with the binary behavior of the perceptrons. The idea, which Michael discusses, is: small changes in the weights and biases of the neurons in our network ought to cause small changes in the output of our network.
That’s the general idea behind the learning algorithms we’ll review. The basic recipe is: initialize a network, feed it some data, inspect its output, and slightly nudge it in the right direction. Because of the binary operation of the perceptron, this nudging isn’t possible with perceptrons (to illustrate: imagine sitting somewhere on the output of Figure 2, above. Suppose you’re sitting at on the output equals zero bit of the step function. If you’re nudged right enough you suddenly step up to one. Get nudged left again and you go to zero. There’s just no inbetween and, hence, no real notion of gradual change.). Hence the new flavor: the sigmoid.
A sigmoid neuron is conceptually similar to a perceptron. The main difference is that both its input(s) and its output can take on any real value between 0 and 1 (rather than just the binary behavior of the perceptron). Here’s a graph of the sigmoid function, which is the output of a sigmoid neuron (to be somewhat careful here: the output of the neuron will only ever be a point that exists on the characterizing sigmoid function of the neuron).

The sigmoid function above is the output of the sigmoid neuron and ranges smoothly over all possible input values (for clarity: the neuron would accept as input, in this case, a value along the x-axis and dispense as output a value along the y-axis). As is clear from the above picture, the action really happens in the input range from about negative four to about four. Interestingly, then, the perceptron and sigmoid aren’t really that different for extremely positive or negative input values: it’s only in this “middle range” where this smoothness comes into play. As Michael says, “[the shape of the sigmoid function] is a smoothed out version of a step function,” which is what the perceptron outputs (as pictured in Figure 2, above).
Michael’s treatment of this is, of course, nuanced and well presented. He discusses in a mathematically rigorous fashion how changes in the output are a linear function of changes made to the weights and biases of the neural network when using sigmoid neurons. For tractability and brevity I’ll omit that here while referring the motivated reader to Michael’s treatment. Intuitively, the smooth shape of the sigmoid is key, as it permits us to slightly nudge the output of the network in the right direction over time and over many training examples.
The architecture of neural networks
This section exists as a means of clarifying and sorting out the terminology we’ll use. A depiction of a small neural network is below:

The pieces of the network are as above: the leftmost group or column of neurons are referred to as the input layer. For example, the input neurons of a neural network could receive as input the greyscale value of a particular pixel, as discussed in 3brown1blue’s video on the topic.
The output layer is the rightmost column or group of neurons (in the above, of course, the output layer has only a single neuron).
The hidden layers are perhaps poorly named. The “hidden” in hidden layers, as Michael discusses, simply means, “neither an input nor an output layer.” A neural network can have any number of hidden layers.
And that’s it for our terminological bookkeeping! Two notes, though, before moving onto the next section, and those are:
A structural note: we’ve so far considered only so-called feedforward neural networks. These are as depicted above in figure 4: One layer passes its output to the next layer, which passes its output to the next layer, and so on until we receive an output. There are, as you may well know, other types of neural networks (e.g., recurrent neural networks), which we’ll discuss in future posts.
Also — and as we’ll soon see — the performance and behavior of neural networks is, perhaps unsurprisingly, dependent upon not only the parameters of the neural network (i.e., the weights and biases of each neuron within the network), but also upon the so-called hyperparameters, which are configurable parameters set by the machine learning enthusiast prior to running the network. So, for example, things like: the number of hidden layers, the number of neurons (total and within each layer), the learning rate. We’ll cover these topics in great detail in our learning journey.
A simple network to classify handwritten digits

In this section Michael proposes using a three layer neural network to classify individual handwritten digits. The overall structure he proposes is quite simple, consisting of a single input layer, a single hidden layer, and a single output layer.
As illustrated in 3blue1brown’s excellent video on this topic, the input layer will consist of 784 neurons, each of which receives as input a single pixel of the 28 x 28 pixel image that will be fed forward into the network. The input image is greyscale, and so the value of the pixels range from 0.0, denoting a totally white pixel, to 1.0, denoting a completely black pixel.
The single hidden layer of this network will initially have 15 neurons. We’ll adjust this parameter a few times and measure the resulting output (that is, investigate how sensitive our performance is to this hyperparmeter).
The output layer will consist of 10 neurons — one for each digit. As Michael discusses, the highest output value of the network will be considered its best guess regarding the correct label to apply to the input image. So, for example, if the neuron corresponding to the label “5” fires most brightly (i.e., its output is closest to 1 of all of the output neurons), we’ll say that the network has determined that the input image must be a “5.”
Learning with gradient descent

We’ve sketched the broad strokes of the computational engine, as it were, and so now we turn to consider the computational fuel. As discussed above, our network will learn from looking at lots of data and, when wrong, being nudged in the right direction. The data we’ll be using is from the MNIST Database. Since we’ll train our network on this data it’s sensible that we should call this data set our training data set. Michael discusses the origin and characteristics of the MNIST data set in detail, but suffice to say that our data set contains two parts: the training set containing 60,000 labelled images and the test set containing 10,000 labelled images.
Quoting directly from Michael’s book:
We’ll use the notation x to denote a training input. It’ll be convenient to regard each training input x as a 28×28=784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We’ll denote the corresponding desired output by y=y(x), where y is a 10-dimensional vector. For example, if a particular training image, x, depicts a 6, then y(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network. Note that T here is the transpose operation, turning a row vector into an ordinary (column) vector.
So far, so good. It is at this juncture, however, that Michael launches into excellent exposition on gradient descent. Not surprisingly, really, given the title of the subheading. :)
Gradient descent warrants its own article and so let’s pick up tomorrow with a treatment of what gradient descent is, how it works, and how it is that helps our neural networks to learn.
Thanks as always for choosing to spend a bit of your day with me. I’m confident that over time we can all become confident and skilled practitioners wielding this powerful (and fascinating) technology to our advantage.
Until tomorrow: Happy Coding!
Please feel free to reach out and say hi! I’d love to hear which machine learning projects you’re working on and/or which concepts you’re intrigued by.
Armchair Philosophy with Friends

Perhaps a fun bit of armchair philosophy: Is our inability to introspect into so as to know the operating machinery of the theater of the mind all that dissimilar from our inability to make sense of an agglomeration of weights and biases constituting artificial neural networks?
We can construct a crude and indubitably reductive picture along these lines: Our brains, constituted as they are of organic neural nets, have their weights and biases (and chemical states, and physical properties, and so on), and these aren’t accessible via introspection. Even knowing the physical properties (imagining they could be, at any given instant, written down and thereby known) — even knowing them doesn’t seem to do much good (at least as far as understanding “what’s on someone’s mind” — although, maybe not?). We have to see the brain in action producing the output of the mind and its attendant behavior to really make any sense of it (just like, of course, we have to see the neural network in action to really make sense of it).
And so (reductively and just for fun): what’s the difference between the black box of our brains and the black box of the artificial neural network? Take the case of a self-driving car. If it could provide a (self-consistent and cogent) reason for its action like we can, (e.g., “I braked because the vehicle in front of me braked”) would we need anything more (to evaluate it, understand it, feel comfortable with its operation, and so on)? If so, why? If not, why not?
(And, if so: in the case of a human driver: would we need more than the reason they produced? If so, why? If not, why not?)