Designing Neural Nets for the Human Intuition Challenge

A visual guide to improving simple TensorFlow models

Published in

HackerNoon.com

15 min readJan 17, 2017

This article describes the technical details around data collection, analysis, and neural network development for an experiment summarized in this article: “Are you Intuitive? Challenge my Machine!”

This content represents a lot of experimentation, code, and visualizations that are great for machine learning newcomers to consume.

If you are a TensorFlow beginner or just interested in this unique data set, check out the link to the code on Github at the bottom of the article.

The Experiment

The Human Intuition Challenge emerged while exploring Google’s TensorFlow framework for the purposes of machine learning. When exploring a new technology, it can be helpful to start out with simple examples and build intuition slowly. Visualizing and troubleshooting neural networks is notoriously hard so starting with simpler examples also helps obtain a solid foundation.

“The Eureka Factor,” by Dr. John Kounios and Dr. Mark Beeman cites a study done by Arther Reber who had done experiments on human intuition. If Arthur Reber’s study could be recreated, then a simple and quite interesting data set could be obtained to train a neural network. It could also be used to make some comparisons of intuition between humans and artificial neural networks.

In Arthur Reber’s study, participants looked at strings of letters that were generated by two different rule sets. The participants were asked to simply assign each string to one of two “families”. Here is one example of a string presented to participants:

RNIWKQ

Over time, the participants learned to classify these strings better than chance without actually “knowing” the rule sets used to generate them.

Recreating this experiment for humans is pretty straightforward and the data is usable for machine learning.

The Web Application

Flexibility and efficiency was important for this experiment. Data needed to be generated for both humans and machines while being presented in different forms and in different quantities. Building a web application was a natural way to satisfy the needs of the experiment.

The web application that was built was a NodeJS application with an AngularJS front-end. It was used on a local computer to generate training data for the neural network and it was deployed to an Amazon EC2 instance to collect data from humans.

gitsome/HumanIntuitionChallenge

HumanIntuitionChallenge - A project to compare human intuition agains neural networks with TensorFlow

github.com

The web application’s first role was to manage “schemes” (rule sets) that would be used to generate random strings like those from Arthur Reber’s study. The UI had screens to create, delete, and edit those “schemes.”

For this experiment, two schemes were generated, “A” and “B”. Each scheme was a list of transformations that were applied sequentially to a randomly generated string of 7 characters.

If you want to take the Human Intuition Test before you see what transformations were used in the experiment, then you can do so now:

Take the Human Intuition Test!

Okay, here are the transformations for Scheme A and Scheme B:

Each transformation script uses a global “cursor” class instance to navigate around the string to modify letters using some helper methods. As an example, you can see the last transformation in Scheme B is:

cursor.moveTo(4).set(cursor.getRandomCon());

This transformation moves the cursor to the 5th letter (because the first letter is in the ZERO position), and sets it to a random consonant.

Gathering the Data From the Humans

Below is a screenshot of what the human test looked like within the web application. This delivery mechanism was how the web application could record human accuracy. Getting real people to take the online test was the next hurdle.

The Human Intuition Online Test : Wrong Answer!

Duncan Watts’s book, “Everything is Obvious : *Once you know the answer”, gave the solution to gathering human data. In the book, Duncan describes some non-obvious results from using human workers in the Mechanical Turk ecosystem. Mechanical Turk is an online service ran by Amazon to organize and deploy “human intelligence tasks” that real humans complete.

By setting up a “human intelligence task” on Mechanical Turk, 73 people signed up for and completed the task online in less than 5 hours.

( Read about experiences using Mechanical Turk in this article : “How to Attract “Turkers” and be a Mechanical Turk Hero!” )

Gathering the Data For the Machines

Getting the labelled data for training the neural network was the easy part. The web application used the configured schemes to generate random strings with their corresponding labels “A” or “B”.

Developing the Neural Network

After the analysis on the human participants was completed, a neural network needed to be setup for comparison with the human participants.

View the human results here: “Are you Intuitive? Challenge my Machine!”

How does the process of creating a neural network start?

The data set will ultimately determine the structure for the optimal network. For instance, doing image recognition on the Image Net data set requires a deep network with many convolution and pooling layers in order to extract more and more complex patterns. It is also often recommended to make your hidden layers as small as possible to avoid overfitting.

The data set isn’t as complex as a 2 dimensional color image, so it was reasonable to start with one hidden layer and then move onto more complex networks and more layers as needed.

What Should the Inputs Look Like?

Feeding letters straight into the neural network is not ideal. This is because it is important to make sure every important feature is represented and can be manipulated mathematically.

A simple way to transform each 7 letter string into a numerical input would be to map each letter to a corresponding digit between 1–26. This would mean that “A” would be 1 and “Z” would be 26. In the mathematical sense, “Z” may be treated as more important because of the larger magnitude which could lead to undesired results during training.

There is another option that still portrays relative position between letters but does not weight letters differently. This option is to transform each string into its one hot encoding representation.

The one hot encoding format will transform each letter into a 26 dimension array that will be ZERO for all entries in the array except for a single ONE in the position for the letter that is “ON”.

For example, take the first letter “W” in “WIYLEWS”. For the first letter you could say the “W” is “ON” and the rest of the possible letters are “OFF”. In a one hot encoding representation, this would be a matrix notation with a ONE for the 23rd letter and a ZERO for the rest. As a single dimensional matrix/vector that would look like this:

[0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,1,0,0, 0] — “W”

Repeating the process for the entire string gives:

[0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,1,0,0, 0] — “W”[0,0,0,0,0, 0,0,0,1,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0] — “I”[0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,1, 0] — “Y”[0,0,0,0,0, 0,0,0,0,0, 0,1,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0] — “L”[0,0,0,0,1, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0] — “E”[0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,1,0,0, 0] — “W”[0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,1,0, 0,0,0,0,0, 0] — “S”

Each of these arrays gets concatenated into one big ordered array that now represents all the possible letters that could be “on” or “off” in 7 different combinations. Each input will therefore be a single dimensional array of length 7 x 26 = 182.

This method would give the neural network access to every possible bit of information that was available.

The Single Hidden Layer

With the proper inputs created, a neural network could now be developed. The first artificial neural network explored was a simple one that had a single hidden layer. This hidden layer had the same number of neurons as the number of inputs (182).

The code for the neural network with one hidden layer is here:

gitsome/HumanIntuitionChallenge

HumanIntuitionChallenge - A project to compare human intuition agains neural networks with TensorFlow

github.com

Comparing the learning capabilities of the neural net with those of the humans was the goal. The best human performers were able to hit 80% accuracy in an average of 43 examples. Therefore, it would need to be determined how many training iterations it would take for the neural network to reach 80% accuracy.

During training, a batch size of 1 was used (adjusting weights after each input) for one epoch (number of times using the data set) to mimic the iterative training the humans received.

Here are the hyper parameters and configurations used:

Batch Size: 1
Epochs: 1
Non-linear Activation Function: None (simple linear activation)
Gradient Descent Learning Rate: 0.03
Initial Random Weight Function: Random Uniform with stddev = 0.01
Initial Random Weights for Bias: Zeros

Single hidden layer as a first attempt (3 inputs on the left represent the actual 182 input used in the real neural network)

After the hyper parameters were optimized, the neural network achieved an average of 77% accuracy on the verification data after 43 training rounds. Below is a graph of the accuracy on 20 random data sets:

The neural network was learning, but the best humans could hit 80% on an average of 43 attempts. A more powerful neural network was required and it was time to troubleshoot.

Visualizing the importance of each input is a good first step to see what a neural network is learning. To do this, the neural network was trained on 2500items to get it to converge to 99.8% accuracy. Since this network had a single set of weights, the effect of each input on classification is transparent.

Below are the 182 weights associated with each Scheme organized by string position and letter value:

The weights represent many of the rule sets pretty well!

For example, in Scheme A, a random vowel is forced in the 5th position, and for Scheme B a random consonant is forced in the 5th position. You can see that all weights in the 5th position are considered significant (warmer or cooler on average).

Visualizing what the machine is “thinking” helped validate that the hidden layer was doing exhibiting some expected behavior. The patterns that emerged from the weights looked appropriate but it seemed strange that the neural network could not beat the best humans.

The single hidden layer network was great at identifying letters in specific positions that were related to the rule sets generating the strings, but it could have been mis-representing the relative association between different letters and positions.

For example, take a look at these transformations applied to Scheme A:

cursor.forEach('P', function (c, i) {
    c.next().set('S');
});
cursor.forEach('F', function (c, i) {
    c.next().set('S');
});
cursor.forEach('V', function (c, i) {
    c.next().set('S');
});

These rules say find every ‘P’, ‘F’, or ‘V’ make the next letter ‘S’. This should translate to the neural network recognizing ‘P’, ‘F’, ‘V’, and ‘S’ as significant in classification. If you look back up at the weights you will see Scheme A considers ‘S’ important but the corresponding letters ‘P’, ‘F’, and ‘V’ are quite “cool” for Scheme A.

The problem however is that right now if the net is shown an ‘S’ it will automatically put more weight towards Scheme A. That could often be a false positive because the requirement wasn’t just to have an ‘S’, but to have the letters ‘P’, ‘F’, or ‘V’ preceding the ‘S’.

Perhaps a single layer wasn’t enough?

An Attempt at Deeper Learning

The single hidden layer did well but there were some position invariant patterns that needed to be taken into account. Adding more hidden layers with non-linear activations should help with recognition of position invariant patterns.

Scheme A and Scheme B have rule sets that create both position dependent and position invariant patterns but before testing the original schemes, it would be helpful to test a set of schemes that have ONLY position invariant patterns. Here are two such schemes C and D:

/*======= Scheme C ========*/cursor.forEach('A', function (c, i) {
    c.next().set('Z');
});
cursor.forEach('B', function (c, i) {
    c.next().set('Y');
});
cursor.forEach('C', function (c, i) {
    c.next().set('X');
});
cursor.forEach('D', function (c, i) {
    c.next().set('W');
});
cursor.forEach('E', function (c, i) {
    c.next().set('V');
});
cursor.forEach('F', function (c, i) {
    c.next().set('U');
});
/*======= Scheme D ========*/cursor.forEach('Z', function (c, i) {
    c.next().set('A');
});
cursor.forEach('Y', function (c, i) {
    c.next().set('B');
});
cursor.forEach('X', function (c, i) {
    c.next().set('C');
});
cursor.forEach('W', function (c, i) {
    c.next().set('D');
});
cursor.forEach('V', function (c, i) {
    c.next().set('E');
});
cursor.forEach('U', function (c, i) {
    c.next().set('F');
});

In general schemes C and D each find a set of letters and then place a corresponding letter after each occurrence. This creates patterns that do not occur in the same spot but can occur throughout the string.

The first network to attempt learning Scheme C and Scheme D was a neural network with a hidden layer with non-linear activation and another hidden layer with linear activation.

With 1500 items from Scheme C and Scheme D for training, this neural network hit 94% accuracy. Using Olden’s connection weight algorithm, the following input “importance” image was generated:

If you consider the patterns for Scheme C and D, you will note that the relevant inputs are being used for prediction.

This neural network was able to identify important inputs but it needed to be tested against the original schemes A and B. Below is the graph of it’s accuracy for 20 random data sets over 150 training items from Scheme A and Scheme B:

At trial 43, the variance is a little high but the mean is around 84% which is much better than the 77% that the single hidden layer neural net achieved. It was evident that this network was better in the early iterations, but the variance did not settle as quickly as the single hidden layer. Here are the importance measures for the inputs for this neural net after being trained on 400 items:

It looks as though this network could identify the inputs that were important, but the extra hidden layer may have been losing some of the certainty that comes from the position dependent rules.

Enter the Multi-Track Neural Network

What was needed was the best of both worlds. The ideal neural network for this experiment needed to handle both position dependent and position invariant patterns. It would be great to somehow merge the two different networks into one.

Luckily, combining multiple classifiers is a common technique that Perrone 94 discusses. One of these techniques is called competing experts and a flavor of this approach seemed to fit this situation the best.

The competing experts method uses multiple “experts” (neural networks). Each of these neural nets are fed a copy of the inputs and they each produce an output that has the same size. These outputs are then combined and the entire network is trained simultaneously.

Multi track neural net, track one with two hidden layers, track two with one hidden layer

This network performed better. The top track (2 hidden layers) and the bottom track (single hidden layer) combined their outputs via addition and the final classification is made in a single softmax layer. See the graph of the accuracy of 20 random trials with this multi-track neural network below:

The network squeezed out 88% accuracy in 43 training steps with less variance buy the 43rd step and beyond.

It was great to see some improvement, but proving that the neural network was doing exactly what was intended was not determined and requires further investigation. Here are the importance measurements for the two different tracks after 400 iterations of training (this is where the neural network stabilizes at around 99%):

You can see the single hidden layer in Track 2 has a similar pattern as the original single hidden layer.

Track 1 (which had the two hidden layers) does not seem to show any patterns related to the importance of the inputs. It is possible that the first track is overfitting and causing the importance of position dependent rules to be hyper focused on allowing it to hit 100% accuracy. More exploration of the effects of this multi-track architecture are needed and will be discussed in a future article.

Conclusions

This experiment explored three different neural networks and their application to a unique data set. The goal was to see how quickly the accuracy could be raised while also minimizing variance in order to beat a group of human participants on the same types of data. Note that the goal of developing neural networks is to create a model that effectively consumes novel input data. It is not necessarily the case that the network learn quickly, though it can be useful to understand how to increase efficiency and reduce variance.

There is a lot of work that needs to be done to flush out an optimal neural network that can handle many different rule sets for this experiment but here are a few insights that may help you and others understand this experiment and machine learning in general.

First, here is are the accuracy graphs for the three different neural networks explored in this article trained over 700 iterations:

Here are a few interesting points from this experiment to consider:

The human participants performed incredibly well but their results had substantial variance. Likewise substantial variance occurred at the 43rd iteration for all the neural networks. It would be interesting to apply mathematical rigor to illuminate the limits of variance and accuracy for this data set.
Variance for the single hidden layer and the multi-track neural net dropped off quicker than the neural net with two hidden layers. In fact, the two hidden layer network continued to experience higher variance in accuracy even well beyond 600 iterations of training.
Hitting 100% accuracy should have been possible because the rules used to generate the strings in Scheme A and Scheme B made the sets mutually exclusive (guaranteeing a vowel or a consonant in the 5th position) but achieving 100% accuracy in a short time was difficult to do. The multi-track neural network achieved 100% accuracy and eliminated variance the best.
The application of a variation of “competing experts” was ad hoc and needs more investigation. There are likely many improvements that could be made and more analysis is needed to explore the combinations of multiple classifiers on this data set.

Lessons For Newcomers

This experiment is great for those learning about artificial neural networks and TensorFlow. Here a few lessons that are appropriate for beginners to keep in mind. Experts, please chime in with your own guidance to help beginners (like myself) to avoid common pitfalls.

Learning how to visualize what your neural network is doing is crucial. Learn how to generate images using Matplotlib and also generate data to visualize in TensorBoard.
Assumptions can get you into deep trouble fast. Dive deep and confirm what you think is happening. Find mathematical backup when possible.
Learn methods for understanding relative input importance, changes in loss/accuracy over time, and activations of individual neurons from specific inputs.
Trial and error is a valid and important method that should be automated when possible like for optimization of hyper parameters.
Expanding your toolbox of machine learning techniques can lead to new and effective implementations.

Thanks for hanging on and please respond with any suggestions that others will find useful.

Thanks for reading! Click the 💚 below to recommend this article to others!