Artificial Neural Networks -1, explained differently to my son

Published in

Analytics Vidhya

14 min readOct 23, 2019

A corner of the universe, where our intelligence evolved. Is there a general intelligence out there?

Although there might be a general intelligence, of which human intelligence is just a partial instance, we haven’t gone beyond mimicking the biology of human brain. Artificial Neural Network (ANN), the digital version of the neural network model of brain, has survived and turned out to be a major tool in the modern AI. Today, a neural model is to an AI application what a transistor is to a whole computer system.

Goal

The goal of this series of posts is explaining the essence of ANN in a most simple and intuitive way. The essence of ANN is in the process of training.

The objective of this post is setting the stage for the study of the training technique of ANN. The technique itself will be explored on the next post.

This post is different - it’s verbose, with little code.

This post tries to differentiate itself by focusing on a simplest problem while preserving the generality and connections to the horizons of modern AI. Although we will be wrestling with a single-dimensional feature, it implies the generality of multi-dimensional features. Do not be fascinated in the multi-dimensionality, but in the essence. Essence is better shown in simplicity.

If parts of the post are repetitive and verbose, it’s partially because I tried to talk about the far side of the moon. Little attention has been paid to the dark side of ANN. I wanted to take care of those deep thinkers who would raise a series of questions about the ANN fundamentals. Talking about the role of intuition in ANN is one of the specialties of this post.

Intelligent agent

A widely accepted framework of AI assumes an intelligent agent, the agent’s goal, and the environment.

A common misunderstanding tends to view a robot as an intelligent agent. It’s wrong in many situations. An intelligent agent does not perform an action, but selects it. The subject that perform the action is an actuator. The actuator, in this regard, belongs to the environment.

A robot taking an action =

the software (intelligent agent) selecting the action, plus
the legs (actuator) performing the action.

The critical requirement for the agent is that the agent must have a complete control over itself. A robot’s leg may fail to fulfill an order chosen by the program. It means the robot has no complete control over itself. What has control over itself is the program.

Training and prediction

We train the intelligent agent with known facts about the environment, in the hope that the agent will predict an unknown fact in the future.

train = let the agent know some facts
predict = the agent suggests a new fact conforming to a goal

or, equivalently

train = let the agent experience the environment
predict = the agent choose an action for a given state, to achieve a goal

We are going to start with a hen-egg problem:

(Training) A hen laid 5 eggs in January. She laid 25 in March.
(Prediction) How many did she lay in February?

Features space, label space, and fact space

The intelligent agent now knows two facts:

The hen laid 5 eggs in Jan.
The hen laid 25 eggs in Mar.

A fact has two component: a set of features and a label.

the set of features [January] is labeled with a label [5 eggs],
the set of features [March] is labeled with a label [25 eggs].

Have you noticed that we are gradually removing the domain-specific knowledge from the facts? We removed hen, laid, and in. Why not go further to remove month and eggs? So, the facts are refined to the following:

(1, 5)
(3, 25)

A 2-dimensional Real space RxR is the minimum space to accommodate the hen-egg facts. The first R of RxR is for features and the other R is for labels. A fact space, therefore, is the product of the features space and the label space.

A fact is the couple of a features and its label

So, the hen-egg problem can be defined as follows, which is a statistical definition:

There are known facts: (1, 5) and (3, 25)
What is x if (2, x) is a fact.

No domain knowledge at all?

The statistical definition of the problem implies that we ignore the domain knowledge — hen, lay, eggs, months, etc, while it would be a great, even critical, help in solving the problem. Focusing more on data than its semantics is the modern AI. This is called a statistical method.

Although we will have to combine the domain technology — hen biology, here — with the pure AI technology in order to get a maximum achievement, let’s go statistical for the purpose of this post. Leave the biological study to someones else.

Prediction is unique? Belief in intuition

“What is x if (2, x) is a fact?”

Does such an x exist at all? Would it be unique?

Which of the following is correct?

The hen laid 17 eggs in Feb
The hen laid 6 eggs in Feb
The hen laid no eggs in Feb

As soon as we get rid of the domain knowledge, 15 eggs seems the most intuitive answer. We know that our intuition has flaws, although it reflects the nature. Intuition, for example, seems to prefer monotonic curves, like “the more …, the more ….”, while the nature is dominated by fluctuating curves, like a sine function. Intuitive means simple natural.

Once we believe in our intuition, the solution to the problem is unique. It’s 15! And we try 15 in reality. It works. Done!

If it happens not to work, then we train the agent with more facts. The underlying foundation is our belief in intuition.

So, the prediction is unique when it’s anchored to human intuition. It’s an intuitive prediction. The flaws of intuition will be compensated by the sheer size of data and the availability of computation matching to the data size. Computers can layer millions of millions of intuitive, simple, and natural thinking atop one another to form a great intelligent machine that can dig a deep exploring hole into the data and into the nature. This seems to be the contemporary point of view in the AI community.

Our intuition is a great gift from the nature. Too great that it sometime crashes deep into the nature. The relativity theory of Einstein, the word embedding models, …

Human intuition * Big Data * Scalable Computation = modern AI
Simplicity is the ultimate sophistication — Leonardo Da Vinci?

We will be talking about simplicity and intuition again in the next post.

Analytical vs. Pragmatic

There are two fundamental ways to create a model of intelligence:

Analytical / symbolic
Pragmatic / statistical

Analytical solutions will do:

y = 5 when x = 1,
y = 25 when x = 3.
So, y = 10 * x -5, in general.

Sounds perfect, doesn’t it?

Domain-specific analytical equations. Great but with less value in AI.

But, how many relations in the world can be expressed in such an elegant mathematical equation? Among those relations that have analytical equations, how many will reveal its equations symbolically, in terms of known elementary functions, so we can practice on it? I don’t want to degrade math here. I am actually a mathematician myself, and see the value of math in other aspects. We can’t rely on analytical solutions even if it’s statistical.

Pragmatic solutions

History seems to prefer non-analytical, non-linear, domain-neutral, computation-intensive, data-intensive, statistical, and experimental methods in Artificial Intelligence. The properties enumerated here compensate and support each other. ANN seems the most eligible technology that possesses all of these properties.

Everything that works in practice is a pragmatic solution, no matter how humble and messy it looks. Students who are fascinated in the rigorous math would find it messy to have so many redundant variables(weights) in an ANN. They are wrong, because it is the elaborated, sophisticated redundancy that makes ANN so successful. The redundancy “melts” into the odd corners of nonlinear curves. ANN is right, more because ANNs really work than because ANN’s redundancy is sophisticated.

In this regard, do not pay much attention to neural biology, unless you are into inventing an entirely new paradigm of ANN that overwhelms the great achievements in AI. Be pragmatic.

The simplest models, from scratch

The simplest neuron models for our hen-egg problem comes below:

A general neural model

We arrived at a linear model: out = 10 * in -5, or y = w * x + b.

This is a linear regression model. Although we know how popular and powerful the linear regression is, we want a neural model to engulf a linear regression model as its special case. The only way is to put a (nonlinear) function on the linear regression model. We will get a general model :

y = f ( w * x + b )

The function f is called the activation function. The enormous role that an activation function plays will be explained in the next post.

If f(x) = x, then y = w * x + b, which is a linear model. Now, a neural model generalize a linear model. A typical ANN model has hundreds of neural models inside that are woven in a sophisticated way. Note that linear regression models that are linearly concatenated atop each other has no meaning, as they would result in nothing but a linear model. But, the neurons layered atop each other generates a completely new space of models.

[y = f2 ( w2 * f1 ( w1 * x + b1) + b2] vs. [y = w * x + b]

Activation functions are the heart of an ANN model. Make it linear, the model would be nothing but a dump of ruthless straight things. The wizard that decides the success and failure of an ANN model, that makes the model better or worse than a linear regression model, is the activation function. We will have a chance to talk about it later.

For the simplicity’s sake, we assume an activation function of f(x) = x, for the moment. It means we have a linear, or no, activation function.

Be pragmatic to creep to a model like an insect

How did you find the third model [out = 10 * in -5]? Be honest and say “I did it by algebraic analysis.”

For few problems that are as simple as ours, we can find an explicit, analytical model at one shot. But the dominating majority of the real world problems have no analytical model, or are reluctant to reveal it. We have to stay so pragmatic as to struggle to creep to model, however slow it is, instead of shooting a star down form the sky at once. The modern AI is doing this. How in particular?

We need to define the model and model space, before ‘creeping’ in the space.

What is model, again?

Slowly we go mathematical, but not analytical. In a narrow meaning, a model is the mapping from features to a label that captures minimum required information of reality.

model : features space →label space
model(features) = the label on the features

As told earlier, the fact space for our problem is R x R. We have two facts: fact1 and fact2, which are nothing but a smallest part of the model. fact1, for instance, means model(1)=5.

A model can be depicted on the fact space, as a mapping from the features space to the label space.

Models, objective and subjective

We can find that there are two models in every problem. One is the objective model and the other is the subjective model. We believe that the objective model exists even if it is unknown. It must precisely fit the known facts.

The objective model must precisely fit the known facts. We know nothing but that.

We want to find and use a model that is as close to the objective model as possible, which is called the subjective model. A subjective model is the up-to-date model that we use in place of the objective model.

The overall practice of (supervised) learning is to drive the subjective model towards the objective model. We creep to the unknown objective model by driving the subjective model.

What is the objective model in our hen-egg problem?

We don’t know exactly.

But if we trust our intuition, and if the two facts is everything we know about the objective model, the best subjective, not objective, model should be the following:

The objective model should be somewhere near the best subjective model.

How to find where the objective model is?

Do we have to find it?

No. And no way. All we have to is driving the subjective model (SM) to the objective model (OM). It’s enough to find out a way to push SM to OM.

The objective model is known directly but partially through the known facts. In our hen-egg problem, OM(1) = 5, and OM(3) = 25.

Just start driving our SM towards OM!

We are going to drive our SM towards OM, but what is the initial SM? The AI practice takes a random model for the initial SM.

Question 03: Why does AI tend to rely on randomness? Interesting.

Suppose our random, initial SM is [y = -10 x + 25], as shown below:

It’s time to introduce the model space

As we are going to drive our SM to OM, we have to think about the strategy: the starting point, the goal point, and the road map between the points. All the points on the road are a model, and we need a base map showing all the possible models, which is the model space.

Different problems have different types of model space. If we are convinced to have a linear model for the hen-egg problem, a model has the form y = w*x + b, represented by (w, b), and the model space is R * R.

Be careful, the model space might not contain OM. We can’t assume that OM has the form y = w*x + b, as SM has. OM is generally more complex and not likely to have an analytical expression, as SM has. The model space is for SM only.

Where is OM, the goal? — It’s partially know through the known facts.
How to get near OM? — We can find it partially because where OM is is known partially.

Before we commit ourselves to finding out how to get near to OM, we have to make it clear again why we are going to do this. Let’s get aware of the goal before embarking.

Why we do this

Once we have done this and have driven SM to the ultimate point near SM, we are going to use SM in place of OM, and find SM(2): how many eggs did the hen lay in February? The prediction, invention, of a new fact.

The overall procedure is just generalizing from the known facts to find new facts, new truth. It can’t be more natural. The only barrier we have to cross is that we have to believe in intuition.

Creep like an insect. Zigzag like a most unintelligent creature, towards an intelligence. How to creep? Let me leave it to the next post.

Summing up

The goal of this post is to set the stage for the study of the training technique of ANN. So far in this post, we got an abstract yet intuitive image of neural model through a most simple problem. While keeping the problem as simple as possible, we explored connections to the horizons of practical AI. Possible rejections of ANN from pure mathematicians were also taken care of.

Now, we are ready to proceed to the essence of ANN — the training techniques. Let me leave it to the next post. I will not be verbose as I was in this post, because many of the odd things of AI that come from its pragmatic nature were explained in this post.

Additional pep talks

Don’t be horrified that training relies on the belief in intuition. AI has some established methods to validate and test the result of learning. And note that AI seems not popular in critical applications yet, like air traffic control and nuclear reactor control. There, however, are encouraging achievements.

One day, I created a business proposal in English and google-translated it into a Chinese version. For a validation, I google-translated the Chinese version back in English to generate a new English version. Comparing the old and new English versions, I found the new version was even better than the old one! (I am not a native English speaker.) It turned out that the translator didn’t lose a bit of context in a round trip of translation! It even enriched the document!

When I put “the chairman” in each of two successive sentences in doubt of the translator’s ability to track context, the second one was replaced with 他 — he/him in Chinese! I felt horrified. There seemed to be a man behind the screen monitoring me and my context! Have a look into the modern Natural Language Processing, and Machine Translation, in particular. Behind the google translator, there are digital neurons and simple intuition but layered and paralleled hundreds times.

One of great achievements of ANN is word embedding, by which words are decomposed into a number of principal components. The principal components are not imposed from outside. They, instead, are perceived by an ANN model. The decomposition into the components are also perceived by the ANN model. It will be interesting studying the meaning of each components. It’s exciting to think that any of a million words is a combination of 300 fundamental concepts.

Word embedding is the digital model of our language usage, which reflects the culture most profoundly, leading to wonderful applications. Look how the model works:

model.most_similar(positive=['woman', 'king'], negative=['man'])
'queen': 0.7
(The word that is most [similar to 'woman' & 'king' and dissimilar to 'man'] is 'queen'. The score is 0.7)model.doesnt_match("Apple, Microsoft, IBM".split())
'IBM'
('IBM' is the least integral to the word group. We know Apple and Microsoft are related to each other more closely than they are to IBM. They two are direct competitors.)model.doesnt_match("bank, account, river".split())
'river'
(A bank account is shown more often in our life than the bank of river.)model.similar_by_word("cat")
'dog': 0.8
(The word 'dog' is most similar to 'cat', in terms of what?)# Joking below:model.who_are_you(?)
I am nothing with millions of simple neural models interwoven with each other.model.what_shall_I_do_to_master_ANN(?)
Understand exhaustively a simplest neural model, and then study how to interweave them to get a goal.

We will be diving into the art of training in the next post.

This post is followed by another post: Artificial Neural Networks -2, explained differently to my son.

Artificial Neural Networks -1, explained differently to my son

Published in Analytics Vidhya

Written by Fleetpro@gmail.com