# How to teach logic to your #NeuralNetworks ?

Logic gates are the fundamental building blocks of electronics. They pretty much make up the complex architecture of all computing systems today. They help in addition, choice, negation and combination thereof to form complex circuits. In this post, I shall walk you through couple of basic logic gates to demonstrate how Neural Networks learn on their own without us having to manually code the if-then-else logic.

Let’s get started with the “Hello World” of Neural Networks, which is the XOR gate. A XOR gate is a exclusive OR gate with two inputs A and B and an output. It’s called an exclusive OR because, the gate returns a TRUE state as an output if and ONLY if one of the input states is true.

We know what are the inputs and what is the output expected. Given that we know what is the output expected, this becomes a “Supervised Learning” exercise for the Neural Networks.

## Which Neural Network framework to choose?

There are several Neural Network frameworks out there. The most notable are the following:

TensorFlow: TensorFlow is a open-source library for numerical computations by Google. The framework is quite generic and flexible to model any type of data flow graphs and computation. The current version has a full stack support for Python programmers (though there is a C++ API that is not as comprehensive).

Theano: Another Python based open source framework for large scale machine learning and numerical computing. Focuses heavily on efficient large scale multi-dimensional array computation, using NumPy and GPU.

Keras: Keras is a library written on top of TensorFlow and Theano, which claims to be a minimalist and modular library built for fast experimentation.

Torch7: Torch is also a BSD open-source licensed framework for machine learning and neural networks. The original authors claim to have developed Torch from ground up with GPU (Hardware efficiency in mind). It is based on the LuaJIT scripting language.

Caffe: A deep-learning framework focusing on model architecture configurability and speed. In fact they claim to be the fastest convnet architecture to date with a 1ms/image processing and 4ms/image learning speed. C++ is the preferred language.

DL4J: Deeplearning4J is a Java based open-source framework which is gaining recent popularity among the Java crowd due to its integrated support for Hadoop and Spark. Many existing enterprises are heavily Java and Hadoop shop and hence DL4J seems to be a easy transition for enterprise. Also (like Redhat), they also have their commercial org, SkyMind which trains, distributes and support all your production class deployments.

Here is a comparison sheet, for what it’s worth that has other frameworks as well: Deep Learning Framework Comparison

While I work on TensorFlow and have dabbled around on Caffe, I shall use **DL4J** for the posts for the following reasons:

- I am not a fan of Python (I am a Scala and Java programmer). Yes, I said it. No, I am not a believer in language flame war. You should choose whatever language is your calling. There is nothing inherently wrong with Python and there are a hell a lot more Python programmers (in DataSciences) out there than Scala.
- But, Scala is mathematically pure and beautiful for functional programming. In my world, Scala trumps Python in expressiveness, speed (compiled vs interpreted) and pure Functional-Programming.
- Scala compiles to Java byte codes and is integrated tightly with Java.
- Most enterprise grade systems are Hadoop and Java shops.
- DL4J is the only framework out there who is doing a decent job on Java and Scala and has a quite hyper-active Gitter community.
- There are an even larger Java developer base than Python as I have observed. Hence I shall choose Java… Nuff said. Let’s code…

Before you begin, you can setup DL4J by following instructions here.

## XOR Gate Example

You can find the full code for XOR Gate Example here > **XORExample.java**

Let’s walk through the code now.

I have setup the truth table in the following section of code :

The **input** variable states that I have 4 records of 2 cols each.

The **output** variable (denoted as **labels**) states that I have 4 records of 1 col each.

In the line of code > input.putScalar(new int[] { 0, 0 }, 0); I am stating that for the {0,0} position load a value 0. In other words, in the 0'th row index and 0'th col index, load value 0.

Similarly: input.putScalar(new int[] { 0, 1 }, 0); states that load a value zero in 0'th row and 1'st col.

This shall load the following truth table into a Dataset **ds**.

I intend to use a network with 2 input neurons, 4 hidden neurons and 1 output neuron for learning the XOR gate. The architecture shall look as follows:

Here {I1, I2} are input neurons, {H1.. H4} are hidden neurons and O is a single output neuron which should learn to output either zero or 1 based on the input values. B is a bias neuron whose value shall be zero.

The hidden neurons are setup to have a sigmoidal activation function

The output neuron shall have a hardtanh activation function as follows:

**hardtanh** has a harder threshold than tanh (and good for XOR) as follows: (Activation functions are explained > here)

The following set of code configures your Neural Network architecture.

You set up the layers and activation function as illustrated above. Key things to note from above are as follows:

- We have used a random SEED which is 100 in our case.
- We have stated that we need 200 ITERATIONS of training on the same truth table. (found through trial and error. Optimizing a neural network is explained > here.)
- We have set ‘epsilon’, the learning rate as 0.7
- Backprop optimization function used is the Stochastic Gradient Descent function. (Backpropagation is explained > here)
- Error function used is the Negative Log Likelihood.
- We have set the weight initialization to be a uniform distribution between value {0,1}

## Negative Log Likelihood

To understand Likelihood, let’s understand probability first.

Probability is used to describe the future outcome given a fixed set of parameters. In other words, given a coin, and given a fixed parameter “head”, what is the probability of it being an outcome? there is a probability of 1 out of 2 outcomes. (When the coin is flipped, a “tail” maybe observed as the future event though)

Likelihood is used after this data is found (the coin is already flipped and an outcome is visible), In other words, Likelihood is used to define the function of a parameter for a given outcome. If a coin was already flipped and an outcome is observed, which let’s say is “tail”, then we ask, what was the underlying statistical process or function that caused this “tail”. In short, we ask, what is the likelihood that a “tail” occurred among all other possibilities. Likelihood can be defined as follow:

Log-Likelihood (as per wiki states), for many applications, the natural logarithm of the likelihood function, called the **log-likelihood**, is more convenient to work with. Because the logarithm is a **monotonically increasing** function, the logarithm of a function achieves its maximum value at the same points as the function itself, and hence the log-likelihood can be used in place of the likelihood.

The key is that they are monotonically increasing and achieves a maximum value at the same points.

To find a best fit weight vector, when we are using **negative** log-likelihood as the cost function, we are trying to **minimize the error **which is the same as, **maximizing the log-probability density of the expected output**. (Note that Likelihood is pretty much the opposite of the probability, in colloquial sense)

negative log likelihood seems to work better for large number of iterations and also when you are quashing multiple values to be demonstrated by the same output or when you use quashing functions like softmax as the output activation function.

The following line of code initializes the Network

And in the following lines of code, we are checking to see what was the output when the network was not trained, the actual training of the network itself, and checking the output again after training.

net.fit(ds); is the line which trains the Neural Network model

The output of this code should look as follows:

This states that there are 12 free parameters in layer zero. This is because, we have 2 input neurons and 1 bias neurons connecting into 4 hidden neurons. That is 3 * 4.

Also, we have 4 hidden neuron + 1 additional bias neuron in the hidden layer connecting to 1 output neuron. So layer one shows 5 free parameters.

Free parameters are nothing but connections for which weights needs to be found. Here, we have to find the correct weights for 17 free parameters.

We can notice that the output values of the truth table before we train is displayed as [0.64, 0.82, 0.85,0.99]. In other words, it can be interpreted as follows: (clearly this is random and the Neural Network is just in a initial state with no learning)

Since we are using a ScoreIteratorListener which outputs the scores to the console, we can see the scores at batches of 100 (we set this value) once the training starts.

At the end of the training we notice another set of output [0.02, 1.00, 1.00, -0.02], which can be interpreted as follows:

Notice that the output neuron has **converged to the expected values** just within 200 iterations of training for a network as small as having 17 free parameters.

The values are close enough (if not precise) to represent the logic gate. In real world electronics, the outputs are not exactly a zero or a one. There are quiescent points and zero-tolerance ratings to denote an ON or a OFF state of the logic gate.

This is the power of a Neural Network, where we can now just send different truth tables for training the exact architecture of the same Neural Network and it can pretty much learn different logic. Of course you need to tune the parameters to ensure the network is learning correctly through trial and error.

Here is the code for AND Gate > ANDExample.java

Notice that I changed the output activation function to sigmoid and the number of iterations to 350 for the AND gate to learn correctly. The output of the AND gate looks as follows:

Which can be interpreted as follows:

Can you now code the behavior of the rest of the logic gates and post the link to your code as comments here?

Also shoot me questions if you missed out on understanding the concepts.