Tips and Tricks
Data has many periodic components you need to visualize? Treat it like audio!
Visualize periodic components using the Fourier transform.
You’re working with timeseries data, and you’re at the initial discovery stage. You’re trying to figure out all the different ways that the data varies in time. There may be trends, whereby the data is constantly increasing or decreasing in time. There may be completely random variations. And there may be cyclic components, that vary up and down in time with some fixed periods.
If there is only one periodic component, that may be easy to see in the data, and you could even get a visual estimate of its period. But what if there are multiple periodic components, each with its own period? A simple plot is not typically useful if the data has more than 3 (sometimes more than 2) periodic components.
Is there some way to make sure you’ve identified all periodic components, each with a different period, along with their own trends and variations in time, and visualize them? That thing actually exists, it’s called a spectrogram, and it’s a well-understood topic in signal processing. Since you’re working with timeseries data, many techniques used in signal processing should apply, since signals (in audio, astrophysics, etc) are essentially timeseries data, where each observation has a timestamp and a value.
Let’s find out how to make a spectrogram.
The Fourier transform
The mathematical basis for this is called the Fourier transform. In a nutshell, if your signal has periodic components, FT is the mathematical process used to extract their frequencies (or periods) from the signal.
Applying FT once, to a time interval in your signal (a window, or a bucket of time), gives you the periodic components within that window. If you move the window to a different place in your signal and apply FT again, you get the periodic components in that window. Repeating the process many times, each time moving the window only a little bit, will paint a picture of how the periodic components evolve in time (if they do).
Lessons from audio
In audio processing, this technique is used very frequently. Capturing a small time window of audio and applying FT to it will reveal the “spectrum” of the signal, or the set of frequency components of audio, at that time.
This is a project I’ve started a while ago with the goal of building the spectrogram (the evolution of the periodic components in time) of any audio file:
If you take the song Why So Serious (the Joker theme) from the movie The Dark Knight and apply soundspec to it, this is the spectrogram you get:
On the horizontal axis, there’s time. On the vertical axis, there’s frequency. Please note, period and frequency are inversely proportional, so period = 1 / frequency. The color denotes amplitude, with yellow showing high amplitude, and blue showing low or zero amplitude.
You can see how between 200 and 400 seconds there are very strong periodic components with a frequency around 30 … 40 Hz, so a period of 0.033 … 0.025 sec (and you can hear that in the song as a powerful bass line). There are many other periodic components scattered throughout the spectrogram — music is typically pretty complex this way.
Back to data
Let’s take a look at this dataset:
https://github.com/FlorinAndrei/misc/blob/master/fft_data.p?raw=true
Initialize all libraries:
import pandas as pd
from matplotlib import pyplot as plt
from scipy import signal
import numpy as np
import os
import pickleplt.rcParams["figure.figsize"] = (16, 9)
Plot the dataset:
if os.path.isfile('fft_data.p'):
dt = pickle.load(open('fft_data.p', 'rb'))
plt.plot(dt);
The first part is somewhat obvious, there’s a periodic component in there, and some noise. But the rest gets pretty complicated. Are there multiple periodic components in there? If so, what are their periods? And how many of them are there? And are they constant in time, in terms of the contribution to the signal, or do they vary?
The SciPy library has a function called signal.spectrogram()
which can generate the spectrogram of this dataset / signal. Let’s apply it:
f, t, Sxx = signal.spectrogram(dt, 1)
plt.pcolormesh(t, f, np.log10(Sxx), shading='auto');
As you can see, the Fourier transform is applied 4 times to the whole dataset, generating 4 sets of spectrum. It’s pretty clear from where the yellow color is that all the action happens at the bottom, where the low frequencies are (where the short periods are).
Let’s keep only the bottom 1/5 of the spectrogram and zoom into it, and the rest can be ignored:
f, t, Sxx = signal.spectrogram(dt, 1)
frac = len(f) // 5
fshort = f[0:frac]
Sshort = Sxx[0:frac]
per = np.rint(1 / fshort).astype(int)
plt.pcolormesh(t, fshort, np.log10(Sshort), shading='auto');
That’s better. You can already start to suspect there are perhaps 3 periodic components, one (with the highest frequency) is important in the center of the interval, the other (with an intermediate frequency) in the second half of the interval, and the third (with a low frequency) seems about the same everywhere.
We need to increase the resolution of the image, in terms of both time and frequency. To do that, we will define a fixed size for the time window via the nperseg
parameter of signal.spectrogram()
. We will also maximize the overlap between successive windows, via the noverlap
parameter: if noverlap = nperseg-1
then the first window is applied to the left-most part of the signal, it is then moved right by 1 pixel, applied again, moved again by 1 pixel, and so on.
nseg = 200
f, t, Sxx = signal.spectrogram(dt, 1, nperseg=nseg, noverlap=nseg-1)
frac = len(f) // 5
fshort = f[0:frac]
Sshort = Sxx[0:frac]
per = np.rint(1 / fshort).astype(int)
plt.pcolormesh(t, fshort, np.log10(Sshort), shading='auto');
Much better. Clearly, there are 3 periodic components. The one with the lowest frequency is constant throughout the whole dataset, and is probably the one we saw in the original plot — the up-and-down waves in the left third of the image. The other two components vary in time — that explains why the plot differs in the last two thirds.
Let’s zoom in again by using an even bigger window (400 points):
nseg = 400
f, t, Sxx = signal.spectrogram(dt, 1, nperseg=nseg, noverlap=nseg-1)
frac = len(f) // 5
fshort = f[0:frac]
Sshort = Sxx[0:frac]
per = np.rint(1 / fshort).astype(int)
plt.pcolormesh(t, fshort, np.log10(Sshort), shading='auto');
This is probably as good as it gets. Keep in mind, there are some tradeoffs here: you want a big window in order to increase resolution, and also to properly detect long-period signals. But by increasing the window you lose data points at the left and right extremes of the image — a big window cannot get “close enough” to the edge to analyze it. See how the horizontal axis gets shorter as the window gets bigger.
On the vertical axis, let’s convert frequency to period. Let’s add a grid to the image. And finally, let’s smooth the image to make it look better (via the shading
parameter of plt.pcolormesh()
).
nseg = 400
f, t, Sxx = signal.spectrogram(dt, 1, nperseg=nseg, noverlap=nseg-1)
frac = len(f) // 5
fshort = f[0:frac]
Sshort = Sxx[0:frac]
per = np.rint(1 / fshort).astype(int)plt.yticks(ticks=fshort[::2], labels=per[::2])
plt.ylabel('period')
plt.xlabel('obs')
plt.pcolormesh(t, fshort, np.log10(Sshort), shading='gouraud')
plt.grid(True)
plt.show()
It’s very clear now: there is one component with a period of 100 points, which is constant throughout the dataset. There’s another component with a period of 50 points, which only matters in the second half. There’s a third component with a period of 25 points, which is strong in the middle of the dataset but weak at the extremes.
The dataset is actually synthetic. This is the code used to generate it, where you can see the components, their frequencies, and their variable amplitudes:
N = 1000
dt = np.zeros(N)
np.random.seed(0)for t in range(N):
dt[t] += 50 * np.random.randn()
a1 = 429 * np.exp(-((t - N / 2) / (N / 10)) ** 2)
p1 = 25
s1 = p1 / 5
dt[t] += a1 * np.sin(2 * np.pi * t / p1 + s1)
a2 = 212 * (np.tanh(5 * (2 * t / N - 1)) + 1)
p2 = 50
s2 = p2 / 7
dt[t] += a2 * np.sin(2 * np.pi * t / p2 + s2 + np.random.randn() / 2)
a3 = 375
p3 = 100
s3 = p3 / 4
dt[t] += a3 * np.sin(2 * np.pi * t / p3 + s3 + np.random.randn() / 10)pickle.dump(dt, open('fft_data.p', 'wb'))
The component with the shortest period has an amplitude that varies in time by a gaussian curve. The intermediate component’s amplitude varies in time as a hyperbolic tangent.
Final thoughts
The Fourier transform works well if data doesn’t have gaps, and the time dimension is uniform. Otherwise, you will have to impute missing data points, fill in the gaps, make sure the time dimension is well behaved.
The notebook with all the code used in this article is here:
https://github.com/FlorinAndrei/misc/blob/master/fft.ipynb
Thanks for reading!