Machine Learning Zuihitsu — II

Eren Ünlü
Nerd For Tech
Published in
7 min readDec 28, 2020

The Most Archaic Self-Attention and Multiplicative Domain

Dr. Eren Unlu, Data Scientist and Machine Learning Engineer @Datategy, Paris

Hello folks. Today I want to share my experiences with you on 2 distinct but highly related techniques, I have been using heavily since the beginning of my ML career. I am going to present you these extremely simple techniques, which may be proven to be crucial, game-changing touches on certain types of tasks and datasets. These 2 simple arithmetic concepts are explained under two subtitles in the article as “The Most Archaic Self-Attention” and “Multiplicative Domain”. I have not come up with these ideas through extensive literature survey nor from the advice of wiser ML gurus in my entourage, as these are highly simple techniques that one might not bother to put in a text book or even decide to tell to another fellow data scientist. I have gradually built these mindsets through years of getting my hands dirty over hundreds of various datasets. Though they might seem obvious, I am sure there shall be certain data science/machine learning newbies, which can gain some insight. So, If you are interested, let’s dive ! :

The Most Archaic Self Attention

If you think in depth, the current science of Machine Learning is almost entirely 2D tabular data processing. In RNN (LSTM/GRU or Vanilla), we extract the hidden layers (sometimes only the last layer); which is a feature vector at the end of the day and fed it to dense neural layers for classification or regression tasks. In CNN, we convolve through kernels to produce a feature vector (where weights of this convolutional kernel is also learned through end-to-end backpropagation), and perform our task. Even in regular temporal tasks, which do not use RNNs, we also first “tabularize” the temporal data through lagging. In various forms of clustering the same story etc. Of course, although there are certain algorithms which do not fit in to this 2D tabular framework, we can still safely say that the modern history of ML is the history of tabular data processing.

Why did I tell these? Well, ironically, I observe a trend in data science / machine learning / artificial intelligence research, where interest in tabular data starts to fade. It is not just sexy anymore. New cutting-edge publications are focused on much specific tasks such as NLP or time-series analysis (where we are not satisfied with results yet, and seek an imminent improvement). This loss of interest has two perils in my opinion : °1. Many ML tasks are inherently tabular, or can be tabularized if we want to. So, 2D tabular data is the core of this science. It is somehow like linear algebra in a higher level. Any improvement in this domain has further potential on every other higher level ML and AI task. 2° Tabular data still constitutes the most frequent form of data in the market. And I feel like, after catboost and xgboost, we have switched to a mode, bragging “2D tabular data : mission accomplished! Next !”. But, we all know this is not the case.

Anyways, the historical evolution process in ML, generally goes from tabular to higher level tasks. But, to my opinion the reverse is also possible and shall yield significant improvement. We can also apply some tricks and techniques that we have developed specifically for NLP for instance, to boost a tabular data processing algorithm.

For example, I have seen Google’s TabNet tries to achieve this but I could not have some time yet to investigate it further. [1] They bring the self-attention concept to 2D tabular data classification. Apart from the technique explained in this paper, let us keep simple. What is the most ever possible form of “self-attention” for tabular data ?

Well, it is just the (datapoint instance of) “covariance matrix” of the input vector (Dot-product of a datapoint’s feature vector with its transpose). Well, before the concept “attention”, I remember the days, when I was checking the correlation matrix of the whole dataset for inspection; that I said to myself “what if I just feed the flattened covariance matrix (instance) of each datapoint” as auxiliary data to the regressor or classifier ? I even remember saying to myself, “this should not bring many advantages and shall be redundant”. However, as we all got familiar recently with the concept of self-attention recently, this makes a little bit more sense now. I was just trying to apply the most basic form of self-attention. Well, in this specific case, it sure is redundant. When, you think about it, if your model is a capable one and you train it well with your whole dataset, it shall inherently capture this effect. So, feeding your model with auxiliary covariant features is unnecessary. (Well, covariance matrix is nothing but the dot-product of features. A neural network, which is able to approximate absurdly complex nonlinear functions through the universal approximation theorem, shall also incorporate this small detail in itself, right ?) But, under certain specific cases, with huge and skewed datasets this redundancy effect disappears, and you observe that, actually, this extra covariant information can boost the performance significantly. I personally gained advantage through this technique on certain types of tasks. And, I sure do recommend you to give it a try when you are stuck with a stubborn dataset. Especially, if you feed actual features and covariant features through the parallel layers, this is nothing but the implementation of the dot-product self-attention for tabular data, where obviously we do not seek positional encoding due to lack of sequential order. And we are not also enforced to feed the whole self-dot-product (let’s call it covariance instance) which measures the degree of relationship of features of a single datapoint, but we can just use certain characteristic scalars or vectors such as determinant, eigenvalues etc.

Multiplicative Domain

Good-old universal approximation theorem tells us that even without incorporating non-linear activation functions, a wide and/or deep multilayer perceptron (MLP) is capable of approximating any nonlinear function on earth. You don’t need to dive into complex equations and inequalities to grasp this idea. Think like this in pieces. Any nonlinear function is nothing but the 4 universal arithmetic operators : addition, subtraction, multiplication and division; right? And subtraction is a form addition and division is a form of multiplication. And a perceptron is nothing but a weighted summation. So, it already implements summation at the most basic level. Hence, MLP just needs to approximate multiplication to approximate any function on earth. Well, it is already easy too. Multiplication by N is nothing but summing N times. Thus, voila! (for a purely linear MLP network without activation). But note that, in terms of input features, what we try to achieve is to incorporate their permutative multiplications, which is much much harder. We cannot just simply assume a deep MLP decorated with non-linear activations or tree algorithms will extract these effects out !

Sure, they will; but the expense of more complexity which in turn results in less overgeneralization and human interpretability. In certain cases, good old linear regression, especially with elastic penalization is more than enough, if you feed extra auxiliary features, such as polynomials of the features etc (which is a form of permutative multiplication of features) As, I mentioned before, the multiplicative relations between features is extremely prominent in the nature. So, as an humble advice, do not ever underestimate the potential of inspecting the multiplicative relations among features. I call this myself, taking data into the “multiplicative domain”.

As you can see, this permutative multiplication of features is almost identical to instance of covariance matrix, i.e. archaic attention, the dot product of a data point’s feature vector with the transpose of itself. So, both concepts are interchangeable and also interpretable from two distinct perspectives.

Another way to approach this issue is to construct a linear system where logarithm of the features are fed as auxiliary in the input and exponential of the output is observed. From our high-school algebra :

After feeding the logarithms of input features (as auxiliary or by just using them) we can just use linear algebra or even more complex tools based on linear algebra such as MLPs. It is a well known tactic among expert data scientists to sometimes use logarithm of the features. I, personally, have coded layers in PyTorch and Keras, which I consult from time to time. I call it a “multiplicative layer”. Of course, I sometimes decide to force the bias of the layer to be zero or other metric based on the task and dataset. And also do the proper “zero” or extremely small number / negative number treatment. Based on the form of the dataset and task, sometimes, I use the log(1+absolute(x)) and reflect the sign as an auxiliary output. Then I can cascade these layers, sometimes along with “normal” dense layers. Also note that, not just for taking data to a multiplicative domain, the logarithm and exponential itself is very inherent in nature. So, it is quite the odds to find natural logarithmic relations among your dataset. Life gave you 2.71… via Euler for a reason :) .

But attention, through backpropagation we might have some problems ! The differential of exponential is still an exponential, and the differential of a natural logarithm is 1/x. So, one has to master to train these kinds of networks with experience, avoiding exploding/vanishing gradients (or ineffective learning) or should not consider at all and stick with the conventional techniques.

Attention ! This kinda layers (especially if you want to cascade several) are prone to malfunctioning (vanishing/exploding gradients) or under-performance for most of the tabular tasks. However, for certain niche datasets and projects, they may help you a lot.

--

--

Eren Ünlü
Nerd For Tech

Data Scientist and Machine Learning Engineer, PhD @ Datategy, Paris