7 data science principles introduced in Asimov’s Foundation

Unsupervised Blog
Balabit Unsupervised
7 min readDec 16, 2015

Authors’ imagination often foreshadowed real inventions. Let’s think on technical achievements first introduced in Jules Verne’s novels like submarine and helicopter, or cellphone-like widgets called communicator in Star Trek, and also the six degrees of separation concept that was described in Frigyes Karinthy’s short story in 1929, couple of years before Moreno started to study social networks.

The same happened with data mining. The first book describing the basic concept was the Foundation series written by Asimov. The first part was published in 1951, decades before any data mining analysis was executed and before computers made it possible to happen at all. The Foundation series became one of the most famous science-fiction opuses, it also won the one-time Hugo Award for “Best All-Time Series” in 1966.

This post focuses on the data mining aspects of the Foundation, and tries to give as few spoilers about the story as possible. The series is excellent and not only because of the hidden data mining principles. The story is also exciting and eventful confirmed by several awards. Christmas is approaching, if you haven’t read the Foundation series (enough times), ask Santa…

The situation in a nutshell: Hari Seldon mathematician created a new profound statistical science called psychohistory. He uses statistics to predict the decay of the Galactic Empire and to influence the long-term future history of mankind.

Let’s see one by one the basic rules of predictive model building, how amazingly Asimov anticipated them.

1. Huge amount of data is needed to produce reliable results.

The definition of psychohistory in the novel:

Gaal Dornick, using nonmathematical concepts, has defined psychohistory to be that branch of mathematics which deals with the reactions of human conglomerates to fixed social and economic stimuli… Implicit in all these definitions is the assumption that the human conglomerate being dealt with is sufficiently large for valid statistical treatment. The necessary size of such a conglomerate may be determined by Seldon’s First Theorem which …

Psychohistory predicts the behavior of crowds. It says explicitly that the analysis is valid only for mass of people. In data mining the same is true: to predict churners of a company or credit loan default accurately and reliably we need large amount of past data. The bigger the better. Maybe this short citation from the Foundation is the first mention of recently hyped big data phenomenon as well. 😉

2. The amount of data implicates that the analysis requires computers, manual computation is impractical.

Seldon removed his calculator pad from the pouch at his belt. Men said he kept one beneath his pillow for use in moments of wakefulness. Its gray, glossy finish was slightly worn by use. Seldon’s nimble fingers, spotted now with age, played along the files and rows of buttons that filled its surface. Red symbols glowed out from the upper tier.

Today it is pretty obvious that we use computers for almost everything. But the circumstances were quite different when the book was published. The prototype of the first commercial computers called UNIVAC were built between 1943 and 1946. Because of a patent rights dispute the first machine was delivered only on 31 March 1951. These first commercial computers were so large that they filled bigger rooms, the processing speed was 0.525 ms for arithmetic functions, 2.15 ms for multiplication and 3.9 ms for division.

So imagining a personal tablet at that time dealing with data of quintillion of human beings means a brilliant divination.

3. Simple predictive models could be refined by adding more fields into the analysis.

He said, “That represents the condition of the Empire at present.” He waited. Gaal said finally, “Surely that is not a complete representation.” “No, not complete,” said Seldon. “I am glad you do not accept my word blindly. However, this is an approximation which will serve to demonstrate the proposition. Will you accept that?” “Subject to my later verification of the derivation of the function, yes.” Gaal was carefully avoiding a possible trap. “Good. Add to this the known probability of Imperial assassination, viceregal revolt, the contemporary recurrence of periods of economic depression, the declining rate of planetary explorations, the. . .”

He proceeded. As each item was mentioned, new symbols sprang to life at his touch, and melted into the basic function which expanded and changed.

Data scientists do the same. To get a good prediction we need to involve several aspects describing all the effects that might be correlated with the predicted events. This way numerous fields are merged from different data sources and also new derived fields are calculated. The accuracy of the prediction is increasing with the number of relevant descriptive variables.

4. The results of the predictions are given in percentages.

“It will end well; almost certainly so for the project; and with reasonable probability for you.” “What are the figures?” demanded Gaal. “For the project, over 99.9%.” “And for myself?” I am instructed that this probability is 77.2%.” “Then I’ve got better than one chance in five of being sentenced to prison or to death.” “The last is under one per cent.”

“Indeed. Calculations upon one man mean nothing. You send Dr. Seldon to me.”

In classification models the results are usually given in percentages. For example we see the probability of churn for every customer. We can define a cut-off value and calculate the predicted class, but that is only a derived result from the percentages and to use the percentages is always more accurate. Confidence intervals could also be calculated to this probability, which is the next item on this list.

5. Use confidence interval.

Within another half year he would have been here and the odds would have been stupendously against us — 96.3 plus or minus 0.05% to be exact. We have spent considerable time analyzing the forces that stopped him.

This one is the most surprising. The confidence intervals were introduced by Jerzy Neyman in 1937! Using it in a novel published in the 1950s means that Asimov was really update in the latest achievements of statistics. This can be explained by his PhD in biochemistry, which he earned in 1948 and his scientific career as a professor of biochemistry at Boston University School of Medicine.

6. Predictions for individuals are much less reliable.

Seldon said, “I’ll be honest. I don’t know. It depends on the Chief Commissioner. I have studied him for years. I have tried to analyze his workings, but you know how risky it is to introduce the
vagaries of an individual in the psychohistoric equations. Yet I have hopes.”

For example in churn models we do the analysis on customer level, but the prediction for a single customer is confusing: suppose that he/she has 67% probability of churn. For non-mathematicians it is pretty hard to explain what does it mean. It is like Schrödinger’s cat. The customer is a churner with probability of 67% and at the same time he is loyal with probability 33%. In this case we would foretell that the customer will churn, but he has notable probability to remain loyal. For individuals the effect of the fortune is large. There is a chance of unpredictable events happening like having an accident or winning the lottery, but the impact of these accidental events decrease significantly if we consider mass of people.
Having 1000 customer with exactly the same 67% probability of churn, we could state that around 670 (plus or minus a small percentage) of them will churn, but we don’t know exactly who those churners will be. The results are always more reliable on aggregated level.

7. Predictions for near future are more accurate than predictions for far future.

I am Hari Seldon! I do not know if anyone is here at all by mere sense-perception but that is unimportant. I have few fears as yet of a breakdown in the Plan. For the first three centuries the percentage probability of nondeviation is nine-four point two. […] Seldon is off his rocker. He’s got the wrong crisis. […] Then the Mule is an added feature, unprepared for in Seldon’s psychohistory.

We’ve been blinded by Seldon’s psychohistory, one of the first propositions of which is that the individual does not count, does not make history, and that complex social and economic factors override him, make a puppet out of him. […]

But the Mule is not a man, he is a mutant. Already, he had upset Seldon’s plan, and if you’ll stop to analyze the implications, it means that he — one man — one mutant — upset all of Seldon’s psychohistory.

The predictions are valid only under conditions similar to the training data’s circumstances. As time goes on the probability of major changes occurring increases so the reliability of the predictions decreases. In the novel 300 years passed since Hari Seldon’s calculations were completed and a mutant man turned up who was so powerful, that he alone could change the history. Seldon had no chance to foresee this happening decades ago when there were no mutants.

The concept imagined by Asimov is working nowadays in data mining but slightly differently: we use computers and statistics to predict the outcome of elections or sporting events, the climate change, etc., but unfortunately circumstances are volatile so we can not make predictions for hundreds of years in advance … yet!

Originally published at www.balabit.com on December 16, 2015 Eszter Windhager-Pokol.

--

--