Knowledge as a list of numbers

Miquel Duran-Frigola
ersiliaio
Published in
6 min readMay 30, 2022

--

If there is one thing I’ve been obsessed about as a researcher, this would be knowledge representation, or the art of converting the information we have about the real world into something a computer can manage (and even “understand”). Knowledge expressed as a set of numbers, arranged in a particular way, that an algorithm can process, operate, store and modify.

This first paragraph was vague. Consider language. There is so much information about the world, in language. If you write the word “justice”, for instance, your mind will immediately recall other words, like “punishment”, “crime”, “victim”, “right”, “wrong”, “innocent” or “court”. In turn, these words are likely to trigger more elaborated thoughts, perhaps an opinion on a recent news, perhaps a personal experience, perhaps a good book you read in your youth (Kafka’s The Process, I’m guessing). And all of this will be framed in a particular definition of “justice” that is bound to a certain time point in History and to a certain region of the world — in my case, the Southern Europe of the 21st Century. If you think about it, the knowledge embedded within a single word is almost unattainable.

“Embedding” is actually a very specific concept in data science. In its simplest form, an embedding is a list of numbers that characterise a given entity of the real world. Let’s stay with the word “justice”. We can assign to the word “justice” a set of, say, one hundred numbers. I am making it up: 0.86, 0.32, -0.56, 1.02, and so on. This list of numbers is called an embedding vector and doesn’t mean anything to us humans, but it can mean a lot to a computer. Now let’s take the word “crime”. As we know, there is a semantic connection between “justice” and “crime”. The question is how to make the computer aware of this link. The straightforward thing to do is to assign to “crime” a list of numbers that is similar to the list corresponding to “justice”. I am making it up again: 0.81, 0.28, -0.43, 1.15, and so on. Now the computer can measure the difference between these numbers, and by realising the difference is small, it will know that there is a connection between “justice” and “crime”. The word “apple” has nothing to do with “justice”, so we would assign a very different embedding to it: for instance, -1.30, 0.01, -0.85, 0.21, and so on.

As you can imagine, things can easily get complicated. The word “right” is indeed related to the word “justice”, but it has other meanings too, depending on the context (“my right hand”, “the right choice”, “right in the middle”, etc.). The word “wrong” is semantically linked to the word “right”, but it has the complete opposite meaning. Also, when we connect “right” to “justice”, are we talking about “right” as in “lack of wrongness”, or right as in “human right”? And now that I think about it, not even the world “apple” is completely unrelated to “justice” — I can get there through “gravity” and then “law”. All of these nuances need to be captured by the embedding vectors, which is why these knowledge abstraction artefacts are impossible to craft by hand. What is more, one can argue that it is a sentence, not a word, what we actually want to embed in the form of a vector, from where it follows that it is a paragraph, not a sentence, what bears meaning in language. It doesn’t really matter, the essence remains: as far as the computer is concerned, a linguistic entity is a mere list of numbers. Knowledge and reasoning is achieved by comparing lists of numbers.

From a data science perspective, the question is how to obtain embeddings vectors for any given word, sentence or paragraph automatically (without the need of a human with encyclopaedic knowledge and extraordinary algebraic abilities behind). The approach of data science to this problem, and to any other problem for that matter, is to look for existing evidence in the real world (i.e. data) and come up with a way to extract useful numerical tools (e.g. embeddings) from it. In the case of language, the existing evidence is the formidable corpus of written text available in the form of books, articles, documents, social media posts, etc. The way embeddings can be derived from this corpus is quite simple, if you leverage one property that language has: words are written in a sequence, and words that are proximal in the sequence are (somewhat) related one another. In a document, we are likely to find the word “victim” nearby the word “justice”, and this will happen many times, in many texts. We are very close to defining the word embedding task more formally: words that co-occur in text vicinities need to have similar embedding vectors.

Language is not the only tool we have to capture information about the real world. Images are another tool, and the embeddings approach can be applied to these tools as much as it is applied to words. If we screen a corpus of images, most likely we will find millions of eyes, and eyes will occur in pairs, symmetrically above noses, which in turn will relate to mouths and ears to conform faces. These spatial relationships can be captured numerically with embedding vectors. I often think of embeddings as some sort of abstract painting, a representation that actually goes beyond the figurative version of the image, since it puts the image in a numerical dialogue with the rest of the images in the corpus.

Abstract painting by Monika Kus-Picco, photographed in the Albertina Museum in Vienna. Pigments are produced from medicines (chemotherapy drugs, antibiotics, multivitamin preparations) that have passed their expiration date.

In biomedicine, my domain of interest, knowledge is often expressed as networks connecting distinct biological entities such as genes, proteins, metabolites, cells or diseases. In these networks, nodes are connected by edges — when we observe in the lab that the anticancer drug imatinib interacts with its target protein (a kinase), then we can add an edge between the “imatinib” node and the “kinase” node, and also an edge between the “kinase” node and the “cancer” node, since targeting the kinase results in killing cancer cells. Thus, a proximity exists between these three entities, which can again be expressed numerically by giving similar sets of numbers (similar embeddings) to the drug imatinib, its target kinase, and cancer.

Biology is an extremely complex domain of knowledge, arguably more complex than language itself. Biological networks can have hundreds of thousands of nodes and tens of millions of edges, with each edge corresponding to a specific data point, a specific experiment carried out in the lab. I find the embeddings strategy very enticing as it can, in principle, encapsulate in a very succinct format all of the scientific wisdom that researchers have been gathering and publishing over the years in specialised journals and databases. A few years ago, at the IRB Barcelona, we started to build a gigantic biological network. The first attempt to extract numerical vectors out of this network resulted in the Chemical Checker, a collection of embeddings related to small (drug) molecules exclusively, with the aim to capture biochemical properties as well as therapeutic indications and adverse side effects. Now, we are releasing the Bioteque, a similar resource of much broader scope, containing embeddings for genes, cells, organs, diseases, and drugs, based on a biological network of 70 million edges collected from over 200 databases. The Bioteque paper has been led by my dear colleagues Adrià Fernández-Torras and Patrick Aloy. I am hopeful that this resource will offer a common framework for biologists and data scientists to communicate, another chance for computers to “understand” what goes on in our bodies.

--

--

Miquel Duran-Frigola
ersiliaio

Computational pharmacologist with an interest in global health. Lead Scientist and Founder at Ersilia Open Source Initiative. Occasional fiction writer.