Time series: what tools are available in Python to analyse them?

Second part: Data filtering, analysis of the components and prediction

Stéphanie Crêteur
Geek Culture
9 min readDec 10, 2022

--

Here is the second and last part of my article about time series. In this part, I will dive into data filtering, the analysis of the components and prediction. You’ll find the first part which focuses on visualisation and statistics by following this link:

As a reminder, I was interested in analysing time series (which can be defined by the chronological evolution of a quantity, regularly spaced in time). The most common time series you can think of would be the stock market. After the visualisation and statistical analysis of my data (which were left intentionally abstract), I wanted to filter my data since it seemed to be noisy.

Photo by Nick Chong on Unsplash

Data Filtering

A filter ‘should eliminate all the undesired components’(Haslwanter, 2021, p. 72). Indeed, if the aim of the analysis is to create a model, ‘[t]he presence of noise in a data set can increase the model complexity and time of learning which degrades the performance of learning algorithms.’(Gupta & Gupta, 2019, p. 466). Noisy data can refer to ‘errors (incorrect examples), as well as outliers (correct examples representing some relatively rare subconcept of the target theory)’ (Gamberger et al., 2000, p. 206). Being able to handle such data is crucial if you want to recognise and analyse the pattern present. It is, however, really important to understand that it might reduce the precision of the data, but in some contexts, you might be ‘more concerned with establishing whether the data exhibits a meaningful relationship, rather than establishing its precise character.’(Janert, 2011, p. 48). Certainly, ‘A smooth curve such as a spline or LOESS approximation is only a rough approximation to the data set — and, by the way, contains a huge degree of arbitrariness in the form of the smoothing parameter (α or h respectively).’(Janert, 2011, p. 57). Of course, you should pay attention to the context and, according to it, make the best judgment. Here, I will use two smoothing algorithms: one that I will implement by myself, and one available in the library SciPy, I will compare the results and see how a filter can drastically change the function and its components.

To choose the best fit for the filter, it is important to consider some criteria. The first one would be to be sure that the data has been sampled at equal intervals which is true for those values. In such a case, ‘FIR- or IIR-filters can be used.’(Haslwanter, 2021, p. 90) otherwise, I would need to use more complex smoothing algorithms such as LOESS or splines which handle the filtering of irregularly sampled data.

An Infinite Impulse Response (IIR) filter makes more sense in a dynamic system, I will therefore be using a Finite Impulse Response (FIR) filter. The FIR filter has no feedback in its equation which makes it very stable. First, I will be using the moving average filter, which is a special case of the FIR filter, and, as the name suggests, they both have a finite impulse response. This method places more importance on the recent data and is very good to reduce random white noise.

Here is the function weighted_moving_average() that was used.

def weighted_moving_average(x,y,step_size=0.05,width=1):
bin_centers = np.arange(np.min(x),np.max(x)-0.5*step_size,step_size)+0.5*step_size
bin_avg = np.zeros(len(bin_centers))

def gaussian(x,amp=1,mean=0,sigma=1):
return amp*np.exp(-(x-mean)**2/(2*sigma**2))

for index in range(0,len(bin_centers)):
bin_center = bin_centers[index]
weights = gaussian(x,mean=bin_center,sigma=width)
bin_avg[index] = np.average(y,weights=weights)

return (bin_centers,bin_avg)

In the visualisation below, we can compare the result between our function untouched and the result of the weighted moving average algorithm.

The result from a weighted moving average algorithm

Another kind of FIR filter is the scipy.signal lfilter whose result is given in the graph below.

The result of the scipy.signal lfilter

It is visible that the two algorithms give different answers which vary as well greatly according to the parameters input. Choosing which one is the best and which parameters to use will depend vastly on the context. We can see with those two results that smoothing ‘reveals behaviour that would otherwise not be visible’ (Janert, 2011, p. 52) and confirms what the visualisation had suggested that there is a recurrence of rises followed by drops around the mean.

Some interesting insights from our lfilter results are given from the summary statistics. If I use again the describe() function, I obtain those results:

Compared to the original function, it is visible that our mean and median are closer even to zero and that the minimum and maximum have been drastically reduced. Therefore, the range of the function is considerably narrowed. As for the distribution, I get the following histogram which seems to be closer to a normal distribution but slightly right skewed with more data in the negative.

Histogram of the smoothed function

To know if those results are in concordance with what you want to achieve you will once again need to check what are the objectives of the study and the model.

Analysis of the main components of the time series

Stationarity

An important part of the analysis is to know if the data is stationary that is, if its ‘statistical properties do not change over time’ (Peixeiro, 2022, p. 39). This can be said true if it has a constant mean and constant variance over time. To calculate and visualise this, I used this code:

yy_average = smoothed_function.rolling(window=50).mean()
yy_std = smoothed_function.rolling(window=50).std()[2]
plt.ylim([-0.5, 0.5 ])
yy_std.plot()
yy_average.plot()
plt.show()

Plotting it using matplotlib, we get the result visible for the original function in the figure on the left and on the right for the smoothed function.

Neither the mean nor the variance is changing too drastically, except in the smoothed function where the mean goes slightly more in the negative around the last values. We can therefore say that our data is stationary which is important to know since many forecasting models assume that our properties don’t change over time. This indeed makes sense because if the statistical properties change over time so should the properties of the model. If the data is not stationary ‘We cannot possibly derive a function of future values as a function of past values, since the coefficients change at each point in time’ (Peixeiro, 2022, p. 39). If the data is not stationary, one possibility to stabilise the mean is to use differentiation and for stabilising the variance to apply a logarithmic function.

Autocorrelation and Seasonality

Another useful tool to understand time series and more specifically its seasonality is autocorrelation which is the matching of a signal with itself over different time delays. It ‘can be used to detect periodicity in a signal which may be impossible to see otherwise’(Haslwanter, 2021, p. 114).

The autocorrelation shows if there are periodical events that impact the data and the strength of the relation between the current data and the past ones. To do that I will use the statsmodels library and specifically its plot_acf() function, the plot on the left shows the result for the original function while the one on the right shows it for the filtered version.

The cone in transparent blue gives the confidence interval (here set at 95%). Therefore, if a point is outside the blue zone, we can say with 95% probability that it has an impact on the current value while if it is inside the blue cone we can consider it equal to zero (Peixeiro, 2022, p. 47). The point at lag 0 is 1 since it reflects how much the current value explains itself. For the original function, we can see that there seems to be no seasonality visible as the coefficient decrease directly at lag 1 under the confidence interval. For the smoothed function, it is a bit different: we see a correlation in the first values but, as we progress, the current values are explained less and less by the previous ones. We can see a dip in the negative that is statistically relevant which says that in the function ‘if a past value is above average the newer value is more likely to be below average (or the other way round)’ (Drelczuk, 2020). This could be the indicator that we are confronted to a mean reversion where values tend to revert towards previous levels after large moves. We can see as well that it has seasonality even though it is not significant enough to take any conclusion and, of course, without a time measure given to the x-axis it is difficult to conclude, but a cycle seems to be forming where the data goes from positive to negative in what can be described as 15 seconds/days/months/years…

Seasonal Decompose

The statsmodels package has another very useful feature, called seasonal_decompose(), that actually lets you decompose the time series into its main features: trend, seasonality and residual. Applying this to the smoothed data, I obtain the results visible in this figure below.

The ‘seasonal’ plot shows how our data deviate from the general trend. It has regular spikes going above the trend while the main part of the data is staying below. As for the trend, which represents the ‘slow-moving changes in a time series’ (Peixeiro, 2022, p. 6), we see that it seems to stay around the mean (a bit below zero for the smoothed function), even though the last 50 points seem to show a tendency to decrease.

Prediction

Those precise descriptions of the function will help go toward the final step which is forecasting i.e. ‘the prediction of the future using historical data and knowledge of future events that might affect our forecasts’ (Peixeiro, 2022, p. 6). In order to do so we will need to take into account the features that we have underlined, especially seasonality which is known to have a significant impact on time series.

One of the methods used to predict future results is a random walk. To be able to use a random walk the data must be stationary and I have shown previously that the original function is purely random with no visible trend or seasonality with tests such as the ACF plot, the running average, etc. Therefore, a random walk would be possible in this context. However, those values seem to be too noisy to really see a point in doing so.

Another option that is suitable if the data is autocorrelated is the moving average. Indeed, in that context, ‘The present value is linearly dependent on current and past error terms.’ (Peixeiro, 2022, p. 88). However, for this smoothed data, the autocorrelation coefficient falls too slowly to use this method and I have repeatedly shown that it seems to have a sinusoidal pattern which makes this model inappropriate. Therefore, taking into account those characteristics, an autoregressive model could be the best answer to predict this set of data. This process ‘establishes that the output variable depends linearly on its own previous values’ (Peixeiro, 2022, p. 91) which means that I will use lagged values to make our prediction.

In this case, I will be using the SARIMA model which is a more powerful extension of the ARIMA (Autoregressive Integrated Moving Average) that supports the seasonal component of the data. Ideally, to find the best parameters for the model, I would need to use all the elements gathered previously. However, it is, of course, quite difficult to set proper parameters without a real time frame on the x-axis. But I still managed to get a realistic result by using the pmdarima library which helps to set the best parameters for the Arima models. The result is visible below. It does seem to make sense since the predictions for the value above the 400th point seem to regress toward the mean which has been a feature of the data that I have shown previously.

And that’s it for the time series. I tried in those two articles to show how to analyse such data with the help of some of the great libraries available in Python. Follow me for more!

--

--

Stéphanie Crêteur
Geek Culture

Python | Data analysis lover. Learning about AI and Natural Language Processing.