Transformers in machine learning — where do they head to?

4 min readMar 20, 2022

In 2017, Google engineers published Attention Is All You Need [1], where they proposed a new neural network architecture model called Transformers and evaluated its performance in translating texts in natural language. And in February 2022, this technology was used in Alphacode [2], which learns to solve some competition-level programming problems.

Programming has long been an unattainable task for artificial intelligence, but now this milestone has been reached. Does this mean the Transformers model is that very artificial intelligence able to replace programmers and other knowledge workers soon, or is it just a step in the development of technologies that can help humans in their works, but still are in no way able to replace them?

To answer this question, let’s consider the two main models of machine translation. One of them involves the concept of meaning: the words of the text are mapped to some universal language of semantic entities, after which these meanings are translated into words of another language and the translation text is formed. This model dominated in the 1970s and 1980s and there were works to automate it in the 1990s. The idea of another model is that it’s generally not necessary to understand the meaning of a text to translate it. On this model, in the 2000s, tools for machine-aided translation were created, where whole chunks of text in one language are matched with pieces of text in another language.

Is the operation with meanings obligatory for high-quality translation? Let two languages allow describing, for example, a certain class of machines. The task of translation is to create a description of a machine in one language from its description in another language. Practice shows that for such a translation there is no need to involve the concept of meaning (the very idea of machine-aided translation came right from the field of technical specifications).

However, not all texts written in these languages will be descriptions of some machines. Not all texts describing machines will correspond to working machines (in the sense that they will be able to perform some function in computable time; by executing a function one can understand matching the machine to another machine that will perform the reverse function). Some of the working machines described will be able to perform their functions only for a limited range of input conditions.

Therefore, having so remarkable machine description languages, it is tempting to use them for other purposes, for example, to design new machines. That is, we describe a new machine by writing text and want to know before we implement it in hardware: is it operable? If so, can it perform a useful function? And if so, under what conditions and restrictions?

A translator does not necessarily appeal to the meaning of what is written, it does not matter to him whether the written is a description of something workable and useful, the translation will be done in any case. And the fact that automatic translation does not necessarily have to appeal to meaning is neither a strength nor a weakness of automatic translation, but a feature of the task itself. But the tasks of evaluating machines by their descriptions require significantly greater intellectual effort, and perhaps even a qualitatively different intelligence.

Construction by description is, in a sense, a process of translation (a translation from the language of instructions and specifications into the language of hardware). But after all, humans can imagine a machine and how it works before they have constructed and tried it, that is, to preliminarily answer some questions about the performance and functions of a hypothetical machine which is in words only. This is where the meaning of words as some conceivable entities that can interact with each other without being embodied is demanded.

Is the Transformers model a step towards using meaning in machine learning or a step towards not using it? So far, one cannot say this unambiguously. We can only say that the choice of translation as an application indicates that the use of Transformers is so far directed to those areas where the use of meaning is not necessary, and the learning program itself and the interface of interaction between the learning automaton with the environment are such that there is no evidence of using meanings, even if they can be computed in the Transformers model. The fact that the output is a program (and a program is some entity that can be executed on a physically existing device and perform some useful function) does not mean that the machine appeals to the meaning of the program it writes: there is no way for the machine to execute the code, no way to see the bugs that emerge, and accordingly, there is no way to learn from these errors: it operates only with words and phrases, but not with the meanings of these words and phrases.

From the results of the experiment with Alphacode, one can only say that there are tasks in programming that do not require operating with meanings, and we’ll probably see automatic tools that learn how to port code between programming languages and platforms soon.

Still, the authors of the Transformers model say they see signs that “attention heads” (which are specific elements of their neural network design) exhibit behavior related to the syntactic and semantic structure of the sentences. Is this an ability to work with meanings? Follow us to see our analysis of the “All You Need Is Attention” paper, where we explain in simple terms what’s done and its relations to other works.

Ashish Vaswani et al. Attention Is All You Need.
Yujia Li et al. Competition-Level Code Generation with AlphaCode.

Transformers in machine learning — where do they head to?

Written by Pavel Malyshkin