Why it is difficult to find something, if you do not know what you are looking for

Published in

Geek Culture

8 min readMay 23, 2021

Plenty of data. Everywhere. They must be worth for something.

Based on this simple assumption a whole new discipline has grown out of nowhere in the blink of an eye. The famous “big data” approach has initially filled magazines and eventually overflowed into scientific pages of respectable journals written by respectable authors. Machine learning techniques, i.e. mathematical tools based on a wise use of artificial neural networks, applied to the analysis of large amount of available data, have brought us stunningly efficient image and speech recognition tools, together with autonomously driven cars and the possibility to do efficient automatic health diagnosis for certain diseases. Big data and machine learning are nowadays at the basis of the most advanced Artificial Intelligence (AI) studies.

Such a successful approach has soon prompted a greater ambition: the substantial overcome of the old-fashioned scientific method. According to Chris Anderson, editor-in-chief of the American magazine Wired, that in 2008 wrote a piece entitled “The end of the theory: the deluge of data makes the scientific method obsolete”, the great availability of “experimental” data let us overcome the traditional scientific approach based on the creation of physical models, meaning theories created by the mind of scientists and based on mathematical relationships and fosters a new approach based, instead, on the massive use of AI techniques. The idea is to skip the search phase of an existing but hidden model, and go directly to the identification of patterns in the data. In the words of Peter Norvig, director of research at Google and AI expert: “all models are wrong and more and more often you can find the solution without using them”. The message is clear: Dear scientist, please stop trying to figure out how the world works, computers will find for us the laws of nature that will not necessarily take the form of intelligible mathematical relationships, but will work just as well.

If this is the case, why not applying it to the search of one of the most elusive result: gravitational waves? The first observation dates September 15th 2014 and came at the end of a thirty years long effort devoted to modelling, designing, building and operating large scale optical interferometer detectors. I have been working, together with tens of other scientists, for many years in order to model and reduce the noise affecting the functioning of the detector. Such a task is of strategic importance because the gravitational wave signal is literally hidden in an ocean of noise. The continuous improvement in noise characterization and attenuation has led finally to the goal of the first detection. Since then a number of other signals have been detected, both from black-hole merging and from binary neutron stars coalescing. The search for new signals is still on and some scientist have now proposed to apply AI algorithms: “We have a lot of accumulated data and we plan to use machine learning techniques to find interesting things. We are looking for new signals that have escaped previous analyses. Maybe a new physics”. This is what Barry Barish that is visiting me tells me when we’re at the restaurant. I reflect on his statement as we both look from the large window at the plain that separates Perugia from Assisi. In the background the Saint Francis Convent, illuminated for the night. I am puzzled by his statement and I tell him very cautiously, because in any case he has already won the Nobel Prize and is not known as one who makes rash statements.

My colleague’s idea is intriguing: to date we have found proofs of gravitational waves because we knew what to expect and we went to look for them in an ocean of noise. Now let’s try to find something else in the same ocean of noise, something we don’t know if there is but it could be: gravitational signals emitted by new phenomena that we don’t even imagine and that could even reveal new laws of Physics. The idea of looking for something, which we do not know if there is and exactly how it is, is not really new and has been at the heart of the work of physicists and mathematicians, who dealt with the so-called CAOS theory, since the eighties of the last century. Among them the Dutchman Floris Takens and the Belgian-born David Ruelle. At that time they dealt with the problem of characterising a bunch of apparently random data: were they produced by some stochastic process (like thermal noise) or were they generated by some dynamic model that shows chaos?

Nowadays, however, we are one step ahead compared to the program of Ruelle and Takens: as a matter of facts, the idea that circulates now is to skip the search phase of an existing but hidden model, and go directly to the identification of patterns in the data. Is this a possible rout to new discovers? I do not know and also Barish and other colleagues are very cautious. What is bothering me is that the ambition behind this approach is way more general that looking for hidden signals. Are we really convinced that the availability of AI programs and large quantities of empirical data will make the traditional way of doing science, based on the work of scientists to create logical-mathematical models of the world, useless?

Some are not convinced and, alas, I am one of them.

There are several reasons why the “big data” program, at least in its most radical version, cannot work. Some of these reasons have been discussed in a beautiful work by Hykel Hosni and Angelo Vulpiani, entitled “Forecasting in light of big data” and published in the specialized journal “Philosophy & Technology”.

Besides these purely technical reasons, there is one that has deeper roots than the others and we want to talk about this here. To better understand what it consists of, we will use a literary aid: the famous story by Jorge Luis Borges entitled The library of Babel. In this story he imagines the existence of a vast library composed of a huge number of hexagonal rooms, all the same, connected by corridors. As Borges explains: Each wall of each hexagon corresponds to five shelves; each shelf contains thirty-two books of uniform format; each book is four hundred and ten pages; each page, of forty lines; each a row, of forty letters of black colour. As will soon be discovered in the story, each book is composed of a random sequence of symbols (22 letters, plus the comma, the point and the space). The library contains all the possible combinations of these symbols and, thus, contains all the writable books that satisfy the conditions of length above expressed. To use Borges’ poetic language, the library’s books describe everything: the meticulous history of the future, the autobiographies of the archangels, the faithful catalogue of the Library, thousands and thousands of false catalogues, the demonstration of the falsity of these catalogues, the demonstration of the falsity of the authentic catalogue, the Gnostic gospel of Basilides, the commentary of this gospel, the commentary of this gospel, the truthful account of your death, the translation of each book in all languages, the interpolations of each book in all the books.

The problem facing the visitor of the library is obviously to decipher the books because, as you can imagine, a book taken at chance from the shelf appears as a sequence of symbols randomly assembled to compose meaningless words. No sense for us, but perhaps expressing the story of our life or a prophecy or even the final equation of Physics, in another unknown and mysterious language.

A book from the library of Babel, if you think about it, looks just like a series of data gathered from the experiment of my American colleague, provided that you assign a number to each symbol. It is apparently a sequence of random numbers but you never know if there is some hidden promising signal, within the ocean of noise. Therefore, looking for a new signal in the experimental data series would not be very different from looking for a sensible and interesting expression, in one of the library’s books. And here comes the beauty: once this analogy is established (a book as a bunch of data) the Library of Babel really looks like the Big Data paradise. It contains all the information of potential interest to us, the problem is “just” extracting this information.

To come to our aid is an Italian mathematician, Lucio Lombardo Radice, who in 1981 wrote a nice booklet, entitled “L’ Infinito”, of great interest for our problem. Lombardo Radice explains that, even if the number of books in the Library of Babel is very large but still finite, the number of meanings that can be attributed to the content of those books is not. Technically this is called “Richard’s paradox” and is part of a family of results that led the logical-mathematician Goedel to formulate his famous incompleteness theorems. An infinite number of meanings potentially correspond to a finite string of characters, or in our context, a certain set of data constitutes the potential answer to an infinite number of physical questions. Without knowing the question, the answer risks being meaningless.

To better understand this point, let’s see an example. Suppose in one of the books, in the middle of senseless strings of characters, there finally appears a sequence written in a familiar language: “while the music goes, Alice and Bob exchange secure messages through their entangled spins”. What is the meaning of it? The two words “entangled spins” indicate two very different things depending on whether you interpret them using a nineteenth century English dictionary or a late twentieth century one. In the first case the phrase would indicate the confidences that two lovers exchange, perhaps whispering mouth to ear, during a waltz tour. In the second case, on the other hand, it would take on a completely different meaning since in the twentieth century physics, thanks to the creation of quantum mechanics, has emerged the concept of “entangled spins” which indicates two microscopic physical systems that have related properties and can be used for encrypted communications. Thus, to a contemporary reader, the phrase would sound like: while the music goes, Alice and Bob exchange safe messages through quantum cryptography. Which one of the two interpretations is the correct one? It depends on the application, which depends on the dictionary I use.

Out of metaphor, the construction of the scientific model (the dictionary in our example), is crucial for the interpretation of the experimental data. Without the laborious, complicated and “dirty” work of the scientist who mixes intuition and induction, genius and creativity, manages to advance hypotheses that will then be denied or confirmed by data, there is no production of true knowledge. To paraphrase the great French scientist Poincarè, we could say that the scientist must make order: science is made with data as a house is made with bricks, but the accumulation of data is not science any more than a pile of bricks is a house.

In short: it is really difficult to find something new if you do not know what you are looking for and learning to ask interesting questions is often much more useful than rummaging through abundantly available answers.

Thanks to my colleague Arjendu Pattanayak for useful discussions.

Why it is difficult to find something, if you do not know what you are looking for

Written by Luca Gammaitoni