# What is a Neural Network?

One of the most powerful and widely used artificial-intelligence approaches is called **neural networks**. But, What exactly are they? And, How they work? Let me explain it in plain English.

# What is a Neural Network?

A neural network is a collection of connected nodes called **neurons**.

# What is a Neuron?

A neuron is a node that has one or more **inputs **and a single **output, **as shown in Figure 1. Three actions take place inside of a neuron:

- A weight is associated with each
**input**to amplify or de-amplify it; - All weighted inputs are summed;
- The sum is used as an input for an
**activation function**which determines the final output.

# What is an Activation Function?

An activation function is a mathematical equation that outputs a small value for small inputs and a larger value if its inputs exceed a **threshold. **An example of an activation function commonly used is the **sigmoid function **shown in Figure 2.

The idea is quite simple: input values close to zero will cause a significant change in the output, while input values too big or too small will cause a minimal difference.

# How Are Neurons Connected?

The output of one neuron can be used as an input to other neurons. Typically, neurons are aggregated into **layers**. A layer is a general term that applies to a collection of nodes operating together at a specific depth within the neural network. Outputs travel from the first layer to the last layer. As shown in Figure 3, there is typically an **input Layer**, one or more middle layers (called **hidden layers**), and an **output Layer**.

- The input layer contains only data. There are not working neurons there.
- The hidden layer(s) is/are where the
- The output layer contains neurons that calculate the final output.

The number of input and output neurons is dependent upon the problem at hand. The number of hidden neurons is often the sum of input and output neurons, but it is not a rule.

# How Do They Work?

Neural networks help us to** classify information**. They are trained (learn) by processing examples, each containing a known *input* and *output*. The goal of the** training process** is to calculate values for the weights associated with each input in each neuron. Once we train the neural network, i.e., we calculate the weights for all the weights, we can use the neural network to map new unseen inputs to an output.

# Example

The Hello-World example for neural networks is usually implementing a neural network to recognize the XOR operator. The neural network for this has

- two inputs,
- one output, and
- we will use one hidden layer with three neurons — as recommended, the sum of input and output neurons.

Our neural network is shown in Figure 4, and the input data that we will use to train the network and the known outputs.

## Step 1. Initialize Weights and Bias

The first step with a neural network is to initialize weights. What options do we have?

- Initialize with zeros only — it would be a poor strategy 😳. Remember, the weights will be multiplied by the inputs, so with wights equal to zero, the inputs no longer play a role, and the neural network cannot learn properly.
**Initialize weights randomly**– it is a bit naïve, but it works nicely very often, except in a few cases. Let’s use this approach for our example.- Advanced strategies are available.

Thus, we are going to initialize the nine weight values in our neural network with random values.

## Step 2. Forward Propagation

It is a fancy name for providing the network with one input and observing the output. We start at the input layer and calculate the outputs for the hidden layer. The results are passed forward to the next layer. Then, we calculate the output in the output layer using the outputs from the hidden layer as inputs. Figure 5 shows the maths. It is just linear algebra. That’s it.

**Step 3. Calculate the Error**

The **error** is calculated as the difference between the known output and the calculated** **output (output ₃ in our example). Error-values are commonly square to remove negative signs and give more weight to larger differences. A division by two does not affect the calculation and will be helpful later for making the derivative more straightforward.

If the neural network has more than one node in the output layer, the error is calculated as adding all the partial errors.

**Step 4. Backward Propagation**

Since we are using random values for the weights, our output will probably have a high error. We need to reduce the error. The only way to reduce the error is to change the calculated value. And, the only way to change the calculated value is by **modifying the values of the weights**. A proper adjustment of weights ensures that the subsequent output will be closer to the expected output. This process is repeated until we are satisfied that the network can produce results significantly close enough to the known output.

How to modify the value of the weights so that the error is reduced?

Short answer: use the **gradient descent algorithm**. It was first suggested in 1847. It applies multivariable calculus, specifically **partial derivatives**. The **derivative of the error function with respect to each weight** is used to adjust the values of the weights. The derivative of the error function can be multiplied by a selected number (called** learning rate**) to make sure that the new updated weight is minimizing the error function. The learning rate is a small positive value, often in the range between 0.0 and 1.0.

To calculate the partial derivatives with respect to the weights, we need the **derivative of the error function** and the **derivative of the sigmoid function. **Figure 7 shows the general equation for the weights update and one example solving the equation for the weight W₆ — the weight of the first input for the neuron in the output layer.

The calculus chain rule principle is applied to compute the derivative of the composite function. Be aware that calculations are similar but not the same for neurons in the output layer and neurons in the hidden layer.

So, we start with random weight values, then:

- we calculate outputs for all neurons using the math in Figure 5 (forward propagation) and the difference between the calculated output and the known output (error).
- If the difference is greater than what we expected, we calculate new weight values (backward propagation).

These two activities repeat until we reduce the error to an acceptable value. An *acceptable** *** error** could be anywhere between 0 and 0.05.

# Coding the Example

Let us see how the four steps described above look in code. We are going to implement a simple neural network in Java. I do not want to reinvent the wheel; just show the nuts and bolts to understand how things work.

First, the attributes:

- A constant value to define the learning rate that we will be using;
- Three variables to store the total number of nodes that we will have in each layer — we will create a neural network with two nodes in the input layer, three in a hidden layer, and one in the output layer.
- Three arrays to store weights values, bias values, and the output of each neuron.

We will create a neural network with six nodes, and we will need nine weights and four bias values for the hidden and output layer nodes.

## Step 1. Initialize Weights and Bias

We can use a constructor to initialize the arrays and put initial values in weights and bias. Remember that, initially, they are just random values. Lines 11 and 13 do the initialization.

## Step 2. Forward Propagation

We need to solve the equations shown in Figure 5. Thus, let us create a method for that. Notice that the inputs are handled as nodes (in the input layer), but they do not calculate output values for these. We calculate outputs for the nodes in the hidden layer and nodes in the output layer. The output is calculated by multiplying weight values times the input value, summing them all, and applying the activation function. We use sigmoid as the activation function, and we create a sigmoid method just to keep the separation of concerns. Noting complex here, basically an implementation of the linear algebra described in Figure 5. We will run this for every single set of input values, thus, it will run 4 times with {0,0}, {0,1}, {1,0}, and {1,1}

## Step 3. Calculate the Error

In our example, with only one neuron in the output layer, the error calculation is pretty straightforward. But, let us generalize the idea in our code by creating an implementation that can be used with one or more neurons in the output layer. This implementation is shown in Figure 11.

## Step 4. Backward Propagation

Finally, let us create the learning part —a method that implements the math responsible for updating the values for the weights. The multivariable calculus lives there. This method is run for every single set of known output values, therefore, it will run 4 times with {0.0}, {1.0}, {1.0}, and {0.0}.

We have all the parts; it is time to put them together and run our implementation. Take a look at the *main()* method for our class, as a summary:

- Training data (input and known output) are represented in two arrays.
- A neural network object is created with two inputs, three nodes in a hidden layer, and one node in the output layer.
- Forward propagation, error calculation, and backward propagation are run 10,000 times.

Finally, let us try our neural network. After 10,000 iterations, our neural network is alive and working with *acceptable* performance. Figure 14 shows how the error rate decrease. The X-axis represents the iteration number (0 to 10,000), and the Y-axis is the mean square error as calculated in lines 18 and 23 of the *main()* method shown in Figure 13.

Not bad for ~100 lines of code (you can download the complete source code from my GitHub repository). However, we could have done the same with ~10 lines of code using a library. One of such libraries is **Eclipse Deeplearning4j****, **an open-source, distributed deep-learning library written for Java. We can use a library and solve more complex problems, such as train a neural network for image classification. Inputs will increase, the training data set will be much more significant (than our four lines for XOR), and we would need more than one hidden layer. But that is another story. Thanks for reading. Feel free to leave your feedback and reviews below.

# References

Do you want to learn more about the details? Review here the derivative of the sigmoid function; review here the chain rule in calculus; review here the gradient descendant definition; and here a detailed description of the maths behind backward propagation.