Filling in gaps in time series data

germayne
eat-pred-love
Published in
3 min readJul 31, 2018

In my current job, I deal with lots of time-series based data. As opposed to cross-sectional data, the temporal ordering of each row is very important. This also makes working with time-series based data more tedious. ( you have to take extra care, else you might make some mistakes. But that is a story for another day)

Today, I will want to discuss on a particular problem: gaps in time-series data. Imagine you have a data set on 1000 stores across a time period of 2014 to 2017 monthly data. There will be scenarios where some stores, say store 1, only recorded data in certain months. For the sake of this example, let us assume that store 1 has missing data points for some months in 2016 and 2017 while everything is intact. At the same time, store 2 has missing data points for some months in 2014 and 2015. Essentially, all 1000 stores do not have a full 36 monthly data.

Typical missing data which you do not find in Kaggle

The problem arises when you want to create some time features, and a basic one that comes to mind will be ‘’previous month’’. A simple way to do so will be applying a lag of 1 period grouped by the stores. However, since the data across the 1000 stores are having different periods, doing a lag transformation will result in a wrong output.

A simple solution is to do a pivot on the data and revert back to the original form. Doing so will automatically generate the missing period. If you are a R user, it will be tidyr::separate and tidyr::gather while if you are a python user, it will be the .pivot and then .melt method from pandas. Below is an example in R. You can apply the same logic for pandas. (In fact, you can refer to another article that I have regarding pandas pivot)

3 line R solution

What if all the stores have incomplete data?

So the above scenario works out if one of the stores has the complete date. For example, if monthly, then one of the stores should have 12 months of data. If it is daily, then one of the stores should have 365? rows of daily data. Otherwise, one simple hack is to append a “mock data”. With this mock data, you can generate with the same headers (as your main data frame) as well as generate a sequence of date. Below I will illustrate this example for daily data.

if you are on python, You can do this similarly:

Once you append it, continue pivoting (tidyr on r/ pivot on python) and undo the pivot and you should have all the respective stores with all the dates as well as NA for the sales. Now you can either leave it as such or replace them with 0.

So I hoped that these little tips helped you in manipulating these pesky time series data!

--

--