# Time series batch processing for outlier detection

Our Department of Advanced Analytics in RBC Group often deals with time series forecasting. Those mainly are daily retail sales with a standard set of issues like missing values, outliers and different lengths of data available for analysis within a historic period. In the worst cases, the customers may have no external data sources like records of their promo campaigns which makes data cleaning even more complicated.

Usually, data cleaning includes outlier detection. Despite there being introduced quite a few out-of-box algorithms for outlier detection, sometimes their utilisation per individual row feels like a waste of computational resources. Moreover, if you mean outliers detection for thousands of time series sharing common dynamic (seasonal) patterns.

In this regard, we’ve started thinking of some batch processing of time series to detect outliers automatedly and intelligently as much as possible.

Further reading will bring you an understanding of our approach to extracting some unified seasonality for a bulk of univariate time series. What you will appreciate most in such an approach is probably the ability to adjust for seasonality (and thus further detect outliers) even those time series having less than two full ‘long-term’ (e.g. yearly) seasons.

## Generating data

To illustrate our approach we will initially generate a toy dataset with all the heavy legacy described above (uneven availability, gaps and fundamental shifts in data).

Here is how some artificially generated time series sharing some common (buе maldistributed) fluctuations of different frequencies (pretend to emulate yearly and weekly seasonal patterns in retail) may look like:

To make those time series more ‘realistic’ we will exclude data from some of them according to this scheme:

… further mess them up with a huge gap and some fundamental shift:

How we’ve got (almost) real-world dataset to experiment with:

## Computing time series barycenter (batch processing)

The most discerning readers probably have already guessed that we need something like a barycenter of all the time series to extract common seasonality. Although there is a great Python module (tslearn) designed especially for this purpose we can’t use it in our case due to those NaN in the data. So we can simply take the *np.nanmean(data, axis=0)* for all the time series stacked but beforehand will slightly smooth all of them. This is the core of our batch processing.

Here is how a random sample will look after smoothing:

… and this is our barycenter:

## Extracting common seasonal patterns from the barycenter

Then we will approximate our seasonality (both yearly and weekly) using Fourier elements. Now watch my hands:

Seasonal patterns in generated data for all the years:

## Detecting outliers after removing (common) seasonality and trend

Further, we will express our seasonality as a share of the target value (i.e. univariate time series) and apply an advanced filter for data smoothing.

This final step is rather trivial. On each time series in a loop we perform:

- an adjustment for (common) seasonality;

- removing linear (yearly) trend from adjusted data;

- detecting outliers as residuals above a certain dynamic threshold.

Here is the output of the algorithm for some random time series:

In other words, those abnormalities in residuals can not be properly explained with seasonality and trend. In real-world cases, you may think of them as extra-large sales “due to special events” which can be holidays or promotional campaigns etc.