Store Sales - Time Series Forecasting

Sebastian
9 min readMar 14, 2022

--

In this ‘story’ I want to show my process through this Kaggle’s begginer competition. I’ve applied some of the main concepts from the book “Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow” and from the Time Series course from Kaggle.

Kaggle Notebook

Goal of the Competition

In this “getting started” competition, you’ll use time-series forecasting to forecast store sales on data from Corporación Favorita, a large Ecuadorian-based grocery retailer.

Specifically, you’ll build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores. You’ll practice your machine learning skills with an approachable training dataset of dates, store, and item information, promotions, and unit sales.

  1. Look at the big picture
  • Framing the problem: Since all the data is already labeled and we are expected to predict numerical values we can tell this is a supervised learning and regression task.
  • Selecting a performance measure: We are already given the specific evaluation metric for this competition which is Root Mean Squared Logarithmic Error (RMSLE)

2. Get the data

  • This is one of the most crucial parts of any ML project. Thanks to Kaggle we are provided with 6 csv files (train, test, store, oil, holidays_events, transactions) containing data about products, stores, promotions, sales, oil prices, holidays and more. Our target: sales.
  • We don’t need to set aside a test set. We can start working directly on the dataset from train.csv

We start fetching the training data from a csv file and creating a Pandas Dataframe called train_df:

train_df = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv', parse_dates=['date'], infer_datetime_format=True)

3. Take a Quick Look at the Data Structure

For now we will be just considering the data from train.csv to get some basic understanding of its structure, but all the data provided in the other csv files will be useful and necessary.

Let’s start taking a look at the top five rows of the training dataset using the head() method:

Top five rows in the train dataset
Top five rows in the train dataset

Each row represents the sales of a specific product (‘family’) in a specific store (‘store_nbr’) on a single day (‘date’). The entire dataset comprises almost 1800 series recording store sales across a variety of product families from 2013 into 2017.

Let’s get a quick description of the data (total number of rows, each attribute’s type)

Train dataset information
Train dataset information

All attributes are numerical, except the family field (and date, whose type is datetime). The family type is object, in this case a text attribute. And the values are repetitive which means they are categorical values for the products: “automotive”, “baby care”, “beauty”, etc. We will later have to deal with these attributes in order for the ML models to learn from them.

First 10 family attributes
First 10 family attributes and their amount

We can notice 33 different families with 90936 values each.

4. Discover and visualize data to Gain Insights. Prepare the data and train a simple Linear Regression Model

So far we have only taken a quick glance at the data to get a general understanding of it. Now the goal is to go a little bit more in depth.

To begin with, we create a copy of the train_df dataframe and structure it with its date, store number and family as indexes:

Store sales dataframe
Store sales dataframe

Then, compute the average sales grouped by day and store it as a series:

Average sales series
Average sales series

Let’s see if we can discover some kind of trend in the average sales for the period between 2013–2017.

Time-step Features

We add a time-step feature on the average_sales dataframe and fit a Linear Regression Model. We then plot the fitted values over time:

Trend line of average sales. 2013–2017
Trend line of average sales. 2013–2017

Time-step features let you model time dependence. In the previous Sales series, we can predict that sales later in the month are generally higher than sales earlier in the month

Lag features

We now change the time-step feature with a lag feature and again fit a Linear Regression Model. Plot it:

Plot sales/lag_1 data and a linear regression model fit.

Generally, lag features let you model serial dependence. You can see from the lag plot that sales on one day (sales) are correlated with sales from the previous day (Lag_1). When you see a relationship like this, you know a lag feature will be useful.

More generally, lag features let you model serial dependence. A time series has serial dependence when an observation can be predicted from previous observations. In Sales, we can predict that high sales on one day usually mean high sales the next day.

Adapting machine learning algorithms to time series problems is largely about feature engineering with the time index and lags. We will use linear regression for its simplicity, but these features will be useful whichever algorithm you choose for your forecasting task.

The best time series models will usually include some combination of time-step features and lag features.

Moving Average Plots

To see what kind of trend a time series might have, we can use a moving average plot. The idea is to smooth out any short-term fluctuations in the series so that only long-term changes remain.

Let’s make a moving average plot to see what kind of trend this series has. Since this series has daily observations, let’s choose a window of 365 days to smooth over any short-term changes within the year. First use the rolling method to begin a windowed computation and then follow this by the mean method to compute the average over the window.

Moving average plot of sales

As we can see, the trend of average sales appears to be about cubic.

Trend Feature

We’ll use DeterministicProcess to create a feature set for a cubic trend model. Also create features for a 90-day forecast. To make a forecast, we apply our model to “out of sample” features. “Out of sample” refers to times outside of the observation period of the training data. Here’s how we could make a 30-day forecast:

You can see the a plot of the result:

The trend discovered by our LinearRegression model is almost identical to the moving average plot, which suggests that a cubic trend was the right decision in this case.

These trend models we learned about turn out to be useful for a number of reasons. Besides acting as a baseline or starting point for more sophisticated models, we can also use them as a component in a “hybrid model” with algorithms unable to learn trends (like XGBoost and random forests).

Seasonality

We say that a time series exhibits seasonality whenever there is a regular, periodic change in the mean of the series. Seasonal changes generally follow the clock and calendar — repetitions over a day, a week, or a year are common. Seasonality is often driven by the cycles of the natural world over days and years or by conventions of social behavior surrounding dates and times.

We will learn two kinds of features that model seasonality. The first kind, indicators, is best for a season with few observations, like a weekly season of daily observations. The second kind, Fourier features, is best for a season with many observations, like an annual season of daily observations.

We will include in our analysis the data from the holidays.csv file, which includes information about holidays and important events, with metadata:

Holidays_events Dataframe
Holidays_events Dataframe

Seasonal Plots

A seasonal plot shows segments of the time series plotted against some common period, the period being the “season” you want to observe.

We will examine the following seasonal plot to try to discover seasonal patterns:

Seasonal plot of weekly average sales
Seasonal plot of weekly average sales

There is a clear weekly seasonal pattern in this series, higher on weekend.

Periodgrams

The periodogram tells you the strength of the frequencies in a time series.

Let’s also take a look at the following periodgram.

Periodgram of average sales over the year
Periodgram of average sales over the year

From left to right, the periodogram drops off after Monthly, twelve times a year. That was why we’ll choose 12 Fourier pairs to model the annual season. The Weekly frequency we ignore since it’s better modeled with indicators.

Both the seasonal plot and the periodogram suggest a strong weekly seasonality. From the periodogram, it appears there may be some monthly and biweekly components as well. (In fact, the notes to the Store Sales dataset say wages in the public sector are paid out biweekly, on the 15th and last day of the month — a possible origin for these seasons).

Seasonal Features

We’ll create our seasonal features using DeterministicProcess, the same utility we used to create trend features.

We then fit a Linear Regression Model

and finally a 90-day forecast to see how our model extrapolates beyond the training data:

Deseasonalizing or Detrending

Removing from a series its trend or seasons is called detrending or deseasonalizing the series.

Look at the periodogram of the deseasonalized series:

Holidays Dataframe

We create a dataframe with data about Regional and National Ecuadorian Holidays from 2017.

Holidays Dataframe (2017)
Holidays Dataframe (2017)
Plot of deseazonalized average sales and holidays (2017)
Plot of deseazonalized average sales and holidays (2017)

From a plot of the deseasonalized Average Sales, it appears these holidays could have some predictive power.

Create Holiday Features

We use the OneHotEncoder method to create holiday features

and then join them to the training data we already had.

Now we fit the seasonal model with the addition of the holiday features expecting the fitted values to improve. Plot it:

First Submission

Let’s create a seasonal model of the kind we’ve learned about for the full store sales dataset with all 1800 time series:

Let’s see some of its predictions:

Finally, let’s load the test data, create a feature set for the forecast period, and then create the submission file submission.csv:

We now have to follow some steps in order to successfully submit the results to the competition!

Conclusion

We have analysed the structure of the data, gained some important insights about trends and seasonality and for both of them created features to train basic Linear Regression Models.

There’s still more we can do with time series to improve our forecasts. In the next ‘story’ we’ll learn how to use time series themselves as a features. Using time series as inputs to a forecast lets us model the another component often found in series: cycles.

We will also improve our currents models by combining what we’ve got and create “Hybrid Models”.

--

--