The delusions of neural networks

Let’s start with a little experiment that will give you a deeper understanding of what a neural network is. Even if you are an expert in AI, please try.

Write down on a piece of paper what you see here (up to two words):

We will be back to this experiment later, but it is very important that you keep that physical piece of paper near you.

A bit of context

The Nexa Center for Internet for Internet & Society is an independent research center, focusing on interdisciplinary analysis of the force of the Internet and of its impact on society.

The thread was about a report from the French Commission National informatique et libertés (CNIL), and I tried to give my technical contribution by sharing obvious considerations about machine learning in general and neural networks in particular.
Such simple considerations come from hands-on experience of the matter, since I’m a programmer.

Turn out, they were not that obvious.

So here I try, in the style of Nexa, to provide useful insights to both the layman that use AI for fun or profit, and to the AI practitioner that is too deep in the field to see it from a broader, interdisciplinary perspective.
It’s lengthy, to be both clear and technically correct, but should worth a read.

Also, while we focus on artificial neural networks here, most of what is said applies, mutatis mutandis, to other machine learning techniques.

What is an artificial neural network?

Currently, they constitute the most exciting research field in Statistics.

“Function”

A function, in mathematics, is a rule that assigns to each element in a set exactly one element in another.

Counting is a function from the set of “things” to the set of numbers.

Addition is a function from (the set of) numbers’ pairs to (the set of) numbers. Such is multiplication. And exponential.

The point here is that, if you have two sets and a rule that map each element of one set to exactly one element of the other, you have a function.

“Statistically”

This should explain why they need tons of data to be calibrated.
They are just like any other statistical application (or ML technique).

Don’t agree? Try to apply them to domains with low cardinality. ;-)

If you think of neural networks (and ML in general) as statistical applications, it suddenly becomes clear that any discriminatory behavior is not due to an obscure “bias” in an inscrutable computer brain.

It’s simply a problem in the data. Or in the code. Or both.

“Approximate”

Consider the function f : ℕ → ℙ that map each natural N to the Nth prime.

We have f(0) = 2; f(1) = 3; f(2) = 5; f(3) = 7 … and so on.

You literally have infinite samples to use, so you might want to use a neural network to approximate this function.

Or, you could try to calibrate a neural network to identify prime numbers.
Again you have a function and literally infinite samples to use.

But, if you try to use their outputs for cryptography, you are doomed.
Still you can’t blame the AI, just your poor understanding of statistics.

“Algorithms”

In 1972, however, Harold Stone provided an useful informal definition that most programmers will agree with: “we define an algorithm to be a set of rules that precisely defines a sequence of operations such that each rule is effective and definite and such that the sequence terminates in a finite time.”

Since computer systems are deterministic in nature (and will continue to be, until the widespread adoption of quantum computing), all algorithms executed by computers are deterministic too.

When a race condition makes a concurrent algorithm non-deterministic, programmers call it “a bug”. We just add time to the equation and fix it.
And when true entropy is fed to a deterministic algorithm to make its results hard to predict, we can still replicate them by recording the random bits fed to the algorithm and replying its execution with those same bits.
We randomize the input, not the algorithm.

“Any”

Neural networks can statistically approximate any function.

Even unknown ones.

If you suspect that a function exists, you can try to statistically approximate it with a neural network, even if you do not know the rule that it follows.
You just need two set. And tons of data.

This is the strongest strength of neural networks. And their weakness, too.

So where is the intelligence?

Can you see the cat? Me too.

This is what we call pattern recognition: we match a stimulus with information retrieved from memory.

We, as humans, are very good at this. Very, very good.

Still, there is no cat there!

Really, there is just a screen connected to a computer. ;-)

Humans are so good at pattern recognition that we can be fooled by it.

Beauty is in the eye of the beholder

We “suspect” that a function exists between these two sets. :-D

We look for a large data set, classified by an expert, such as a Botanist.

We calibrate a neural network to approximate that hypothetical function.

Finally we run the program, and we see that it classifies Iris “like a pro”.

And, just like a mother looks her beloved son, we say: “how smart it is!”

We see a program doing the work of a Botanist and we recognize a pattern.
We are matching the program with experiences from our own memories.

We look at the computer and we see a Botanist. We see an intelligence.

But it’s like with the cat.

Verba manent.

The words we use to describe the reality forge our understanding of it.

Talking about “deep learning”, “intelligence” and “training” is evocative, attracts investments and turns programmers into semi-gods.

It’s funny, but plain wrong.
And dangerous, as we will see later.

Enter, Big Data

We don’t have better algorithms. We just have more data.
— Peter Norwing, Chief Scientist, Google

It’s obvious, once you understand that they are just statistical applications.

Still, the amount of data required to calibrate a neural network is so large that, despite being a 70 years old technology, it became practical just recently.

Today, anybody can easily collect, buy or sell tons of data.

Why? Simply because we leak data. Precious data. Data about ourselves.

The perfect match

With enough data, we could try to calibrate a neural network to select the resumes to consider for an interview. Or to decide the perfect salary for an employee. Or to select the best match for a transplant. Or for a love story.
A company could try to approximate the “right” cost for your insurance.

The “function” in the details

Let’s recap: a neural network can statistically approximate any function.

How many curves pass from a point in a multidimensional space?
And from two points? And from three? And from N?

Turns out that the answer to all of these questions is “infinitely many”.

Can you see the problem?

After training a neural network, we do not know which knowledge it will deduce from the training samples and we do not know what reasoning it will use for its computation. It’s a like a black box.

It approximates the desired output in the range covered by our samples.
That’s all we can say.

More properly

Now, there are a few interesting researches about this issue, but I’m not much optimistic about them. My insight is that being able to deduce the target function from a generic calibrated neural network is equivalent to resolving the halting problem.
After all, DGNs are neural networks too!

BTW, we need a big data set to filter out unwanted functions.

All the headaches you get from overfitting or underfitting are just side effects of this heroic challenge: whenever you feed a sample to the network, you exclude an infinite number of functions from being approximated by it.
Nevertheless an infinite number of functions still fit your data set.
So you can not know which function your network will approximate.
It’s sad, but you can’t really win.

Headache apart, this fact has deep legal implications.

How can you prove that a neural network do not discriminate a minority?
How can you prove that it’s not calibrated to be racist? Or sexist?

Theoretically, you can not.

Brute force to the rescue!

When in doubt, use brute force.
— Ken Thompson, Unix inventor

To show that your neural network is not “trained to discriminate” you simply have to declare the function you tried to approximate and

• provide (and thus safely store, for years) the whole data sets used to calibrate the network, including the one used for cross-validation;
• provide (and thus safely store, for years) the initial values you choose for each topology you tried, and obviously the topologies;
• disclose the full source code, with documentation;
• hire an independent team of experts to verify the whole application.

The experts will try to falsify a theory, a predicate about the network.
Experimentally. They will at least

• verify the data sets against selection bias;
• verify that the training outputs do not actually apply a discrimination;
• carefully debug the neural networks during the whole calibration, to prove that the your network actually derives from that calibration;
• verify that your neural network was the best performing in the cross-validation (that also means to debug all the neural networks you included in the process);
• verify that no programming error affect the network behavior.

If they do not find a programming error (which, trust me, is almost sure) it might take years, since the cost of debugging always grows with complexity.

Still, while very expensive, this approach is always technically possible.
Recall? Neural networks are deterministic programs.

But when does it worth such effort?
When do you have to pay such a huge cost?

A legal perspective (in European Union)

But if you delegate to a neural network a decision about people, the decision is still on your own responsibility.
You are stating that you trust who implemented, configured and calibrated the network and you are accepting to be liable of its outputs.

It’s just a statistical application, after all!

European General Data Protection Regulation

This was pretty clear to people who wrote the European GDPR.

Indeed, at Nexa’s mailing list, Marco Ciurcina pointed out that article 13 and article 14 of the GDPR were relevant to the discourse.

In particular, the point (f) of Article 13(2) states that

…the controller shall (…) provide the data subject with the following further information necessary to ensure fair and transparent processing: (…) the existence of automated decision-making, including profiling, referred to in Article 22(1) and (4) and, at least in those cases, meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.

The point (g) of Article 14(2) is equivalent, but relates to data acquired by third parties, instead of the data subject.

So if you operate in Europe and you apply AI to people, you should be able to explain the logic that led to each of its outputs in a court.

“Meaningful information”

To be “meaningful” the information about a decision-making process (automated or not) must be

• pertinent (they must describe that specific process)
• understandable (to the data subject and the court)

But to prove that they are pertinent, the information must be complete.

And to prove that they are complete, you have to be able to replicate that specific decision-making process using them.

So, to recap, the meaningful information about a decision-making process are all those information that are relevant to the process itself and that the data subject (or the Magistrate) can use to the replicate process itself.

And obviously, to be pertinent, such information must be up to date.

Programmers can see how Magistrates debug. :-)

Just a matter of costs

But you can’t simply state “we didn’t trained the network to be racist”.
Or “the neural network was simply trained so and so”.

You could be lying.

Fortunately, you can prove your statement. With brute force and debug.
It’s just another cost in the budget. Probably a huge cost, but a cost.

Still, if you can’t afford to show that your neural networks (or your MLs in general) are approximating legal functions, it’s wise to replace them.

Pick out

Artificial neural networks are simply deterministic algorithms that statistically approximate functions. It’s just not possible to exactly say which function they approximate.

This is not a big issue, until you apply them to people.
And it’s just expensive when you do. At least in Europe.

Incidentally, the biggest data sets are about people.

And artificial intelligence is not allowed to discriminate on your behalf.