**From “a*x + b” to the most powerful Transformer (in 10 minutes reading 😅)**

A necessary journey from the linear function to complex neural networks.

etermax tech

Published in

etermax technology

10 min readJun 23, 2022

By Ignacio Corres, Data Scientist at etermax

Introduction

The idea of this article is to make an orderly review of what it means to “model” an observed phenomenon with the aim of predicting its future behavior. In the context of our work at etermax, a “model” is a mathematical proposal that describes business dynamics and, due to its ability to predict, helps us make decisions.

A model in its most abstract form is an application that receives “X” and returns “Y”. The ideal is the one that returns a “Y” identical to the one that would be obtained from observing directly. In practice this does not exist, and the difference between the model’s prediction “Y” and the measured value “measured Y” is known as the error and must also be expressed in mathematical terms. The idea is to find the model that produces the least error in the attempt to predict “Y”.

In short, the modeling process begins with the mathematization of the phenomenon, in which the observations become data, the phenomenon itself becomes the model, and the difficulty in predicting new values becomes the error. This triad of elements constitutes everything that the modeling of a phenomenon implies.

1 Linear Models

1.1 Linear Regression

The word linear in this case refers to the mathematical application that defines the model: the linear function “y = ax + b” . We will say that the relationship between the independent and dependent variables is linear. Mathematically this can be summarized in the following line (generalization of the line to “n” independent variables):

In this way, the dependent variable “Y” is linked to the independent variables X_1 through X_n in such a way that the only points that satisfy this relationship form a plane (a straight line in the case of one dimension). ω_0 through ω_n, called parameters, determine the only properties necessary to fully define a linear function. From now on, for convenience, we will call them weights .

If we have a series of measurements of the variables X_1 up to X_n and we want to estimate the value of the variable “Y” from the linear formula, what we must do is find ω_0 up to ω_n such that the error made in estimating Y_m (measured) is the minimum possible. This error can be calculated as (Y_m -Yₑ)² .

The sum of the errors made in each measurement will give us the total error made.

**Image 1: Graph of a line on Cartesian axes modeling a series of 3 data with its error.**

The most widely used algorithm for minimizing error by finding the optimal weights is known as “gradient descent”. If we manage to do this and the linearity hypothesis is valid, the power of this process is to be able to predict the value of a future “Y” but now only by observing the “X” variables.

To visualize its potential in a specific case, at etermax we use these models to, for example, estimate the optimal value at which to offer an advertising space (eCPM) in our products. For this, we have a series of data such as the values at which that space was sold in the past, the type of user’s device, etc. These will constitute the “X” while the CPM, what you want to estimate, represents “Y”. By finding the optimal weights, the way with the least error is found to relate these mentioned variables with the CPM to be predicted. Thus, it is possible for us to know what is the best value to offer the space.

What we have seen so far constitutes the basis of any modern predictive model. The fundamental concepts that we have addressed to define it are data, model, error and minimization .

1.2 Perceptron

The premise of this article is that from what has been seen previously, known as linear regression [1], it is possible to understand the most complex neural networks . A neural network is a type of model that, with the idea of emulating the way in which the human brain processes information, encompasses several mathematical models with characteristics similar to those seen in the previous section in a single one with the ability to predict and describe high complexity dynamics. The basic predecessor unit of neural networks, sometimes called an artificial neuron, is known as a perceptron [2].

How does this relate to the line model? It is simply the same model developed in the previous section but with the difference that “Y” only takes two values, for example -1 and 1. Therefore, the model output must emulate this behavior. For this, the concept of step function is defined (later it will evolve to activation function) . To develop it, we are going to call what we used to call “y” with the letter “z” :

The fundamental idea is that “z” is now not the final output of the model but rather an intermediate input of another function (the step function) σ(z) such that

As always, the whole problem reduces to finding the group of weights ω_0 up to ω_n so that the error in predicting y is the minimum possible. The latter is always the crux of the matter, from the simplest model to the most complex one.

**Image 3: Graph of a plane in Cartesian axes modeling a classification problem.**

Finding the optimal parameters means, geometrically, finding the plane that best separates data of type -1 from data of type 1.

At etermax we use models of this binary type to, for example, determine if an active user is going to play again or not (the problem has a binary answer). Based on a series of measurable data such as the time the user spends in the application, the rate of correct answers in trivia, etc., it is possible to determine with some accuracy whether that user will use our products again or not (if there are still one side or the other of the plane). These are known as “churn” problems.

The success or failure of such a modeling will depend on whether the dependent variable “Y” maintains some linear relationship with the independent variables (either a regression or classification problem). In real forecasting problems, this linearity assumption rarely holds. To address this question mathematically, the concept of an activation function will be important.

2. Nonlinear models

2.1 Neural Networks

Neural networks [3] are the concatenation or grouping of the perceptron but with a fundamental difference: non-linearity. This is achieved by proposing instead of the step function, another non-linear character known as the activation function. Many times functions with images that go between fixed values such as 0 and 1 are used, but they can take any intermediate value. This is the case of the sigmoid function. Then

**Image 4: Graph of a sigmoid function on Cartesian axes.**

There is a very wide variety of nonlinear activation functions but for illustrative purposes, in this article we will always use the sigmoid function .

Now, the idea of this type of neural models is that they are made up of several layers, where each layer is made up of many neurons (the modern, non-linear version of the perceptron).

**Image 5: Abstract diagram of a fully connected two-layer neural network.**

In this way, the output of a neuron is calculated as

Then, we can define a matrix of weights:

Where each row includes the weights of each of the r neurons. Thus, the total output of the layer “L” of neurons will be a string of as many numbers as there are neurons in the layer: H = sigmoid(W*input). The neurons of the next layer L+1 are no longer input as input “X” , but rather the outputs of the previous layer H = ( h_0, … , h_n) and so on until the final layer, which returns the prediction “y”. Although in this case it is necessary to optimize the value of up to hundreds of millions of pesos (in linear regression with a dependent variable there are only two) in a highly non-linear dynamic, the basic idea is exactly the same: find the weights of the model that minimize the error made in the prediction of the dependent variable. This type of architecture has great potential to describe complex problems due to its high non-linearity. They have the ability to describe phenomena whose variables are related in an non intuitive way.

2.2 Transformers

Transformers [4] are a type of modern neural network with high predictive capacity, used fundamentally in language processing. Many automatic translators, algorithms that suggest completion of sentences and answers, and automatic text generators, among other models that are widely used today, use this type of technology.

With what has been seen so far, it will be possible to understand the general functioning of these models that, beyond a slightly different logic, imply the same mathematical technologies.

This model is characterized by a type of layer of neurons called the “self-attention layer”. These are able to extract significant information of a contextual nature. That is, not only between two or more different variables, but also between two or more different measurements of the same variables. To achieve this, three different weight matrices are defined in each layer (in the architectures seen so far, each layer had only one associated layer). These matrices are called “WQ” (Query), “WK” (keys) and “WV” (values), and they act by defining the following vectors:

In this way we obtain vectors qᵢ , kᵢ and vᵢ for each element in the input sequence. The product q_i · k_j is interpreted as the contextual relationship between element i and element j within the sequence. Then, for each element i, a unique vector zᵢ is assembled

We can distinguish in this expression the same type of operation that we have been using since the simplest linear model. When we refer to “input sequence”, in this context, it can be a time series or more generally, any data set with a certain order, for example, sentences or texts .

In order to understand this better, let’s illustrate it with an etermax use case.

Let’s imagine that we need to moderate the content created by users to detect offensive content. If we only filter offensive words, it would be enough to define a list of them and leave out the content that contains them. However, on many occasions, the offensive character of a sentence also depends on its entire form, context, ways of using words, etc. In this sense, it becomes essential that our model can “learn” contextual information. In image 6 we can see a real example of how our models work. While the “black list” model only compares words to a list, the Transformer learns both individual words and contextual information, and calculates the probability of the content being offensive.

Image 6: Real example of how our models work. In the event that the word “kill” is not on the black list, the filter that acts only by looking at words would never detect possible offensive content that includes it. Our Transformers do so depending on their contextual use.

As we have seen so far, any information that wants to be used as input for this type of model has to allow a mathematical representation. If we want our input to be, for example, a text fraction, a sentence, how do we represent it mathematically? Here the concept of embedding arises. Although we will not go into detail about what it is and how embeddings are obtained, it is enough to say that in Natural Language Processing, they are mathematical representations of a vocabulary.

In traditional neural networks, generally, the embeddings representing each word enter one at a time. The power of the Transformers is that they receive entire sentences at once. In this case, the significant relationships between different elements of the dataset that can be calculated by the Transformers are interpreted as significant relationships between different words. Suppose we have the sentence “The dog is white and the cat is sleeping”. Can the neural network interpret that “white” refers to “dog” and “sleeping” to “cat”? The answer is yes and the answer to how is “as usual”: finding the right weights. In this case, the weights of the self-attention layers:

**Image 7: Abstract diagram of the operation of a “self-attention layer”.**

Finding the optimal weights in “WQ” , “WK” and “WV” is what will allow the network to adequately detect these contextual relationships.

For the moment, that’s all we will share about the development of the Transformers models. The idea of this section is to synthetically show that the activation functional form (W ·X ) is repeated in learning models from the simplest linear models to the most complex neural architectures. However, the moral of this article is not to conclude that the mathematical complexity of the different models is the same. The increase in mathematical complexity and, consequently, the increase in computational requirements for its calculations falls fundamentally on the error function minimization algorithms. In the case of neural networks, these calculations can be very complex and expensive. To search for information about this type of algorithm, the keywords “Gradient Descent” [5] and “Backpropagation” [6] may be useful.

Further reading and bibliography:

[1] Linear regression

[2] Perceptron

[3] Neural Networks

[4] Transformers

[5] Gradient Descent

[6] Back Propagation