Machine Learning Zuihitsu — I : Spectral Attention for Time Series

Eren Ünlü
The Startup
Published in
9 min readDec 12, 2020

--

Dr. Eren Unlu, Data Scientist and Machine Learning Engineer @Datategy, Paris, FR.

Thanks to increasing popularity of data science and machine learning and open sourcing mentality also encouraged by big players of Silicon Valley; we experience a paradigm shift for the propagation of technical know-how. Back a few years ago, secrets of the state-of-the-art of artificial intelligence were confined to scientific articles. Yet, free-of-charge accessibility of most of the papers was out of question. Maybe it was not quite a concern or not “hot enough” yet. I won’t dive into the scientific commercialization debate here; but long-story short, half a decade ago it was almost impossible for enthusiasts to acquire the knowledge of AI’s pioneers.

Then came github, where people shared the repository of their codes; the perfect galvanizer for the explosion of machine learning know-how. It remains still as the primary mecca of humanity’s collective knowledge in silicon, including data science and AI. However, still most of the time, the publishers avoid representing their works verbally in a structured and abstracted way — yet if they would be concerned to put some introduction.

Today, we see another relatively recent paradigm shift; where microblogging thrives. Much of the revolutionary ideas on machine learning circulate around microblogging websites; in a relatively more compact fashion with less technical detail, just touching the most significant parts of ideas and decorated with the most crucial code snippets. So, I couldn’t resist the trend, and decided to share my experiences, technical challenges I encounter and some ideas in my mind via a series of articles, which I named Machine Learning Zuihitsu.

When we mention the literary genre of essays, the first name that comes to our minds is French author Monteigne. I love the simplicity, comfort, open ended and self-narrative style of essays compared to the elegant tone of the literature of their era. Just sharing ideas in a direct manner on a multitude of subjects. We can even consider them as the first examples of blogging. However, our association of Monteigne with the birth of this genre is rooted nothing but from Eurocentrism. This particular style was already very popular among Japanese audiences well before the Edo Period, in the name “zuihitsu”. So, I decided to call this series based on this legacy.

I cannot find much time to study extensively my ideas before sharing. Nor to inspect the whole literature. Please, regard this series of articles as essays as the name suggests. I would like to spark some intellectual brainstorming among readers and improve myself further with corrective and additive feedback.

Attention is all we need in life at the end of the day. Even from the deep learning perspective. Google engineers’ seminal paper “Attention Is All You Need” [1] published at the end of 2017 has marked a major milestone for the development of cutting-edge NLP algorithms with the introduction of transformers, solely based on self-attention. As people grasped the ingenuity and potential behind the proposed idea; in less than a year, hundreds of transformer based NLP architectures have flourished around the globe. Eventually, these efforts gave birth to GPT-3 for instance; which even made it to mainstream media [2], where for the first time, people outside the AI community were astonished and convinced by the capabilities and the very tangible potential of sci-fi like thinking machines.

Notion of attention in machine learning predates the transformer and was studied thoroughly. Ironically, the proposed transformer in this paper uses the most basic form of attention, the dot-product attention. But, simplicity crafts the perfection. The transformer suggests to convert everything into nothing but several matrix multiplications with the introduction of the self-attention, eschewing any form of recurrence. In addition, by using several of them in parallel (i.e. multiple attention heads), we are able to grasp and utilize independent latent relationships of a language in one-shot, which provide different perspectives. Long story short, the power of transformers comes from using nothing but the attention, which allows us to harness parallelization at full scale, -as this idea has manifested itself in paper’s title : All You Need is Attention.

The scaled-dot-product used in [1] for the proposed transformer. I personally had some hard time interpreting this concept of self-attention at first. Three notions are at the heart of this philosophy; Query, Key and Value. Calculated query-key vectors allows us to have a sequence length x sequence length representation, providing word-to-word comparison in the given sentence. We transform a vector of information to a query vector, and try to find the most similar key in the system (dot-product of Q and K matrices as a similarity measure). Next, we calculate the self-attention degrees by the corresponding values for these keys in the system. I suppose we can see this mechanism as a “continuous form of database”, where we fetch the necessary information for a given signal, but in a relaxed fashion / weighted average, as we apply a softmax. Q,K and V are three matrices where their values are learned through supervision.

After the success of transformers in NLP, academics and engineers around the globe were eager to apply this marvelous idea to other fields of machine learning such as computer vision or time-series forecasting. For example [3] proposes a solely transformer based image classification architecture. [4] introduces a self-attention transformer network for time-series forecasting and classification.

When it comes to time-series, it is obvious to give a shot to transformers; as NLP tasks are almost identical to any kind of temporal/sequential machine learning challenge. We can even claim NLP is nothing but a subdiscipline of time-series studies, with its own very unique constraints. Hence, without doubt transformers shall thrive on humanity’s quest for the perfect AI forecaster in coming years.

In this article, I would like to share an idea of mine on the self-attention for time-series applications such as transformer based forecasting, classification or segmentation. As I stated in the introduction of this series of blog articles : (°1) I don’t have much time due to my daily professional and personal activities, so I don’t try to formulate a structured mathematical, theoretical and experimental baseline for my ideas. (°2) I can not inspect the whole literature, thus I do not claim my ideas are 100% genuine. (°3) I do not claim these ideas are noteworthy. Keep in mind that, I am sharing certain thoughts of mine with the hope of sparking an intellectual brainstorming among readers; whilst still I may be making erroneous assumptions and/or formulations in the process.

Spectral Self-Attention

So, I call this idea in my mind as Spectral Self-Attention for time-series applications. It is actually quite straight-forward. I propose to use the Fourier-Transform of a time-series as an auxiliary input to a temporal deep learning forecaster, classifier or segmenter. This spectral attention mechanism may be employed by a recurrent deep architecture or a transformer-like structure.

Let me explain the central idea as follows :

Say that we want to predict the next time-step of a given time-series X, on time t. So, we want to forecast X(t+1) on t. Say that, Discrete Fourier Transform (Let it be Fast Fourier Transform) of this signal is S(f). Can we have attention scores on each frequency, and use the weighted average of these spectral attention scores on each time to make a better prediction ? In other words, at each time-step t, where we need to make a decision; on which frequencies we should focus more. Let A(t,f) be the attention scores we calculate on each frequency calculated by an attention-mechanism from S(f); such as a scaled dot-product query-key-value concept. Can this spectral attention yield a better forecasting ?

Let me visualize this concept in my head for you. Imagine, there is a perfect discrete sinusoidal input signal X(t) of length T. We take the N-point Fast-Fourier Transform of this signal. For a continuous, infinite perfect sinusoid the continuous Fourier Transform would result in a single, dirac value; which is on the frequency of this sinusoid, f0. Note that, in a discrete and finite world we can not have this ideal form; but residual power, near-zero on other frequency components. But for the sake of simplicity, assume that the FFT gives just a single non-zero component of a sinusoid.

OK, when we try to predict X(t+1) on t, we will just focus (self-attention) on frequency f0 as our signal has just a single frequency. So, for this simple example whatever t or X(t) be, the system will always focus on f0.

Let us consider a second example with an input signal X(t) as a superposition of two sinusoids with different frequencies f0 and f1 with different amplitudes as in the figure. On time t; while we are trying to predict X(t+1), so how much attention to assign to each specific frequency component; and next after calculating these spectral attention scores, how can we incorporate this information to be able to make a better prediction ?

Note that, at the end of the day, each spectral value is calculated by a weighted sum of all time points’ values (inherent definition of a Fourier transform). Thus, this idea may be flawed or redundant, as we may not even need to consider the spectral density. However, I believe processing self-attention on an auxiliary Fourier transform can boost the performance and lower the required computation.

Let me further solidify my intuition with an example. As it is known, the usual approach in deep learning based time-series forecasting is to supervise the time-series in shifted chunks, converting it to a tabular form (also known as lagging). Let us consider that at time t (either for the learning phase or prediction) of a univariate time-series, we have the L previous points of the look-back phase. And without loss of generality, let us consider the rest of the series until the look back (until t-L) as the long term history. Let us take the N-point FFT of this long term history. Note that this operation will always result in a length of N. And finally, let us use the instantaneous value of t-L as a scalar value for my suggested spectral attention mechanism. As you can see, this system may help us to both have the periodic effect of long term history and short-term dynamics, by stitching these concepts via an attention based system.

As I just mentioned, actually frequency domain and temporal domain are bidirectionally and fully interchangeable. Having whole information on one, we can fully reconstruct the other signal. Thus, at first glance it may seem unnecessary to develop a mechanism based on the FFT of a signal. However, I suspect various interesting ideas can be cultivated on the attention over frequency domain. Such as, a fixed FFT length can be used to train numerous time-series of different arbitrary lengths. Then, the length of each individual training series can be incorporated as a scalar into this hypothetical spectral attention mechanism. Also note that, just for a single source of signal, the length of the series shall increase by time as we gather and/or predict the future. Henceforth, we recalculate the FFT for a fixed size of a varying length signal. But we should still incorporate the length of the signal (or the frequency gap between FFT points) in the process each time. At the end, we may be able to construct learnable matrices; where we can transform the question of time-series prediction at time t, for t+1; as a query and extract spectral attention from key-value pairs. Next, these values can be fed to a neural network for forecasting, just as transformers. I have no idea how to formulate the question of on which time-step t0, on which spectral parts of our current series’ instantaneous FFT (i.e. X(t<t0)). And how to use this weighted spectral attention for prediction. But, I still suspect this kind of an approach on the spectral domain may be more efficient, require less computational effort and yield more interpretable results for humans. I would like to get feedback from the audience on the issue, and whether it is a meaningful suggestion at all.

Cheers,

Eren.

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30, 5998–6008.

[2] https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3

[3] https://neurohive.io/en/news/vision-transformers-transformers-work-well-in-computer-vision-too/

[4] https://www.topbots.com/attention-for-time-series-forecasting-and-classification/

--

--

Eren Ünlü
The Startup

Data Scientist and Machine Learning Engineer, PhD @ Datategy, Paris