Why Are Time Series So Hard to Handle?

Vivien Leonard
The Startup
Published in
6 min readJan 23, 2021

When working on time series, you can find quite a few examples, tutorials, and other resources. But, from my perspective, a lot of time series based project is way harder than any other project I have done. Through this article, I’ll try to explain why I think this is a hard subject.

Graphical representation of a time series i’m working on.

What are time series ?

Time series are objects that measure at different time of a certain thing. The main particularity is the time component. Each observations can/cannot be link with the following or previous observations. You can have dynamical relationships and different levels of relationships. For instance you can find that an observation is link to the five previous observations, but you can also find some really slow, almost silent trend that orient the values of the time series that can be totally invisible at a first glance.

Time series are complex objects, that can present some really subtle behaviours.

A not so well define problem

What, in my opinion, is the theoretical cause that makes time series hard to handle is that it is not a well define problem. Let me explain.

Take a really well know domain of problem : classification.

As data scientists, when dealing with a classification problem, and even before starting to work on our problem, i believe that almost everyone know what the project will look like :

  • Data gathering
  • Data exploration and some statistics about it
  • Data cleaning
  • In some case, some transformation (one hot encoding, etc)
  • Train the model
  • Evaluate the model on a test set, based on a huge collection of metrics available
  • If bad results, change the model and choose one from the huge collection of classification model available
  • And boom, you’re done

Well, sorry for the caricature, but for me, that’s essentially what a classification problem looks like. My point is : this is a well known problem, with a well know machine learning pipeline to achieve a decent solution.

Why is it a well know problem ? In my opinion : our work is to solve problem. Formulate a classification problem is straight forward : we want to classify an object into a category. And the validity process in order to check if the solution is correct is also quite straight forward : is the object in the correct category ? Yes or no. So when someone tells you to solve a classification problem, you’ll, in most cases, easily understand what they are asking you to solve.

When dealing with time series, the first challenge is to formulate the problem. Do we want to forecast a really precise value ? And for what step in the future ? Are we even sure that we want to forecast a value ? Instead, maybe we want to just extract a trend.

For this different reasons, time series is, for me, a way harder problem, because of what it implies. The field of expected possibilities that could be acceptable is very large, but how to choose the best solution ? It requires a lot more work from the company when formulating the problem. And it is likely that the formulation will be really hard to approaches and to make the data in the right form to solve the problem.

Discouraging toys examples

I don’t know for you, but when i started working with time series, one of the first thing i did was to look for resources on how to handle time series. I found a lot of toy example, like this one :

The famous air passengers time series example

I’ll make a guess, but i’m pretty sure that almost everyone working with time series has already seen this example. That’s great, you can have a first glance on some time series handling. I discover the concept of stationarity, Dickey Fuller test, differencing. That was cool, but then i tried with my own data, and it didn’t work. And i repeat this process a lot of time, and come to the following conclusion : in my opinion, each time series problem is unique, and you won’t find an off the shelve solution. Of course, in classification, you can have to deal with problem that are quite rare, but honestly, there is a great probability that you find online an example of someone dealing with a problem that is quite close from the one you’re handling.

With time series, i found myself looking for days, even weeks, for some sort of clue on how to handle my data.

Also, quite often, even if you don’t find a lot online, i believe that you’re not dealing with a single time series, but with time serieS. For instance, i’m currently working on a problem where i have to handle more than 17 000 time series. And i don’t think i’m an isolated case : take a company that sell different product. If she have 100 products, you’ll have 100 time series. And a lot of companies don’t sell only 100 products. And when dealing with this much time series, you’ll find yourself face with this problem i promise : “alright, i have n time series. I want to produce a forecast, but i can’t tune n model.” And you’ll find yourself looking for some auto approaches (i recommend auto_arima from pmdarima in python by the way). There is some, but there is mainly solutions that are made (obviously) for single time series. And even if you use those solution for multiple time series, there is huge computational cost. For instance, for around 250 time series, auto_arima took several hours to compute.

To sum up, what makes, in my opinion, time series hard to handle is that it can quickly needs a complexe machine learning pipeline.

Transformation, forecast, and error additivity

I can’t talk about time series without talking about transformation. When dealing with real world data, there will be a lot of cases where you’ll need to transform your data in order to be able to make the forecast. And again : what transform ? Differencing, cumulated sum, logarithm, exponential smoothing or maybe moving average smoothing ? For instance, with my project, even if i changed my approach, what gave me the best result is the differencing of the logarithm after i scaled my data (with a standard scaler, based on mean and standard deviation). But i also tried to look at the signal method with a high frequency filter based on a Fourier analysis.

Well, as you can tell, there is a lot of possibilities, and again, you’ll have to choose the right one in order to make a good forecast.

When forecasting, you’ll also need to be careful about error additivity. If you want to make a forecast, and you transformed your data, if you have error in your forecast, when coming back to your original data, you can have some bad surprise.

Metrics

To end up my small review of why this time series are so hard and so frustrating, i would like to talk about the metrics. When doing classification, testing your result is quite comfortable. I remembered a project during which i had to classify objects, i had around 7 metrics to support my test, and i tested 15 models that each gave me some pretty good results. The difficulty with time series is that it is not a binary task. If your test forecast is the same as your original data, there is a great great chance that your model is overfitting your data.

And how to evaluate your forecast ? You can use some metrics like RMSE, MAE, MAPE, but the hard part is when you have to say, combining those metrics, if this is a good forecast, and for me the hardest, comparing 2 models based on these metrics. Well, one more hard task for the time series.

Conclusion

To conclude, i would say that time series are complex and hard to handle objects. But if you manage to handle them, they will deploy so much knowledge that it will be worth all the pain you came through. Well, almost, because pain, you’ll know what it means when you’ll handle time series, and it can take several life to recover.

--

--

Vivien Leonard
The Startup

PHD Student working on NLP and semantic web on Twitter data