Regression Project (Store Sales — Time Series Forecasting)

7 min readFeb 22, 2023

Source: https://i.ytimg.com/vi/AvG7czmeQfs/maxresdefault.jpg

1.0 Introduction

In this project, we will predict store sales on data from Corporation Favorita, a large Ecuadorian-based grocery retailer. This is a time series forecasting problem.

Specifically, we are to build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The training data includes dates, store, and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models.

1.1 The Data

Overview of the datasets with brief description

The test data, having the same features as the training data. You will predict the target sales for the dates in this file.

The dates in the test data are for the 15 days after the last date in the training data.

2.0 Ask Stage

Here, we put down the questions that we intend to answer at the end of the analysis process. The following hypothesis was stated and questions were asked to guide the analyses.

2.1 Hypothesis

Null Hypothesis: Series is non-stationary

Alternative Hypothesis: Series is stationary

A time series is said to be stationary when it is independent of time. Meaning that values are not affected by the time at which it occured.
Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times.
I’ll explain more as you follow along.

2.2 Questions

For our Exploratory Data Analysis (EDA), we must ask the right questions:

Is the train dataset complete (has all the required dates)?

2. Which dates have the lowest and highest sales for each year

3. Did the earthquake impact sales?

4. Are certain groups of stores selling more products? (Cluster, city, state, type)

5. Are sales affected by promotions, oil prices and holidays?

6. What analysis can we get from the date and its extractable features?

7. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

ADDITIONAL QUESTIONS

8. What is the trend of sales over time?

9. What is the trend of transactions over time?

10. Highest and lowest performing stores

11. Highest performing family of products

3.0 Data Preparation and Processing

At this stage, we organize the data to make it fit for analysis. Cleanliness and consistency of data are the objectives here.

3.1 Notes from Previewing the DataFrames

Our date starts from January 2013 till October 2017 in train data
Sales has a strong positive correlation with onpromotion, so we’ll focus on sales and onpromotion since they correlate with one another the most
No missing values in both our train and test data
The family column has too many categories. This could complicate our machine learning model later on, so we might group them into smaller categories later

3.3 Cleaning the Data

Given that this article is to present a summary, I will focus on the major activities performed on the DataFrames. The detailed functions will be found in the notebook, a link to which will be attached at the end of the article.

Converted date columns from object to datetime
Filled in missing dates in train, transactions, holidays, oil
Set the date as index
Dropped unnecessary columns in holidays, train and test
In holidays, deleted rows with transferred as true
renamed column names for easy understanding
Merged datasets together according to what will help to predict sales

4.0 Answering the Questions

Here, I combine the “Analyse” and “Share” stages of the data analysis process through the code and visualisations.

4.1. Is the train dataset complete (has all the required dates)?

Yes, the train date has missing dates.

4.2. Which dates have the lowest and highest sales for each year?

‘2016–05–02’ followed by ‘2016–10–07’ have the highest sales

4.3. Did the earthquake impact sales?

The earthquake has no overall effect on sales

4.4. Are certain groups of stores selling more products? (Cluster, city, state, type)

The cluster group has sold more products compared to others.

4.5. Are sales affected by promotions, oil prices and holidays?

Sales has a weak correlation with all, but its correlation with onpromotion is the highest (0.43)

4.6. What analysis can we get from the date and its extractable features?

4.7. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

ADDITIONAL QUESTIONS

4.8. What is the trend of sales over time?

4.9. What is the trend of transactions over time?

4.10. Highest and lowest performing stores

4.11. Highest performing family of products

5.0 Feature Processing & Engineering

Here is the section to clean, process the dataset and create new features.

5.1 Impute Missing Values

I filled missing holiday rows with the same value as ‘work days’, since non-holidays are work days.

5.2 New Features Creation

I used the ‘get_date_features’ function I created earlier to create new features to help our models understand the dataset better.

5.3 Features Encoding & scaling

I used the LabelEncoder to turn categorical columns into numbers that ml models can understand. In this case, I skipped scaling because the models I’ll be using don’t require any scaling.

5.4 Dataset Splitting

I split my train data into train and evaluation set(3000 rows).

6.0 Machine Learning Modeling

6.0.1 Traditional ML Models

I tried 6 different traditional ml models and here are my results for my trainn set. Here are the results for the trainn set.

Looking at the rmsle scores, DecisionTree gives the lowest error so far.

6.0.2 Model Evaluation (Traditional Models)

We went further to test it on our evaluation set. See the scores below.

DecisionTree Scores for evaluation set

We get a rmsle of 0.75. Lets use backtests to test our model again. Each backtest consists of train and test set. Each test set has 15days of data (Similar to the test set provided). Below are the scores from each backtest.

Our backtest gives us rmsle scores of 0.58, 0.58 and 0.59. This concludes that none of the traditional models we trained have good rmsle scores (<0.4). This leads us to our classical time series models.

6.0.3 Classical Time Series Models (Statmodels)

Our statmodels train using sales column only. Using this, it will then predict sales for future dates. That being said, we’ll first resample our train data by daily mean.

We resample because since we are using only the sales column, our statmodels won’t do well with plenty of zeros in it.

Here are the scores for the train set. The AR model gives the lowest error.

6.0.4 Model Evaluation (Statmodels)

First we make predictions on our evaluation set and check error scores.

We get rmlse of 0.09, which is very good, I’m not satisfied. I also did a backtest to be sure.

Our backtest gives RMSLE scores of 0.08, 0.05 & 0.11. Now I’m sure that my AR model is the best. Finally, we’ll make predictions on our test data.

6.2 Predict on a unknown dataset (Testset)

In this part, we prepare our test set the same way we prepared our train data. Then we use our AR model to predict unknown sales. We then save our results as a csv file called submission.

7.0 Conclusion

So far, we have trained an efficient model that can predict future store sales. Find below a link to all the code on github.

GitHub - ikoghoemmanuell/Regression-Project-Store-Sales----Time-Series-Forecasting-

This is a time series forecasting problem. In this project, we will predict store sales on data from Corporation…

github.com