Forecasting Store Sales Using Regression Models

Aliyyah
16 min readSep 17, 2023

--

Photos From Built In

Introduction

Accurate prediction of store sales is crucial in a business because it serves as a foundational element for multiple aspects of a business’s operations. Accurate sales predictions also play a crucial role in a company’s financial planning and budget management. It helps by empowering businesses to allocate resources effectively, plan expenditures, and make well-informed financial decisions, contributing to financial stability and strategic growth.

Project Structure

This analysis will adhere to the CRISP-DM (Cross Industry Standard Process for Data Mining) framework.

Business Understanding

This project involves predicting store sales using data obtained from Corporation Favorita, a prominent grocery retailer based in Ecuador. The primary objective involves developing a robust and accurate time series forecasting model that predicts store sales for a wide range of products across Favorita stores.

Data Understanding

train.csv

The training dataset consists of time series data that includes features like store_nbr, family, onpromotion, and the target variable, which is sales.

  • store_nbr(store number) identifies the store at which the products are sold.
  • family identifies the type of product sold.
  • sales gives the total sales for a product family at a particular store at a given date.
  • onpromotion gives the total number of items in a product family that were being promoted at a store on a given date.

test.csv

The test data has the same features as the training data. We predicted the target variable sales for the dates in this file.

The dates in the test data are for the 15 days after the last date in the training data.

transaction.csv

It contains the date, store_nbr, and transactions made on that particular date.

sample_submission.csv

A sample submission file in the correct format.

stores.csv

The store data includes the city, state, type, and cluster.

The cluster is a grouping of similar stores.

oil.csv

It contains the date and the dcoilwtico. The “dcoilwtico” data represents the daily values of the West Texas Intermediate(WTI)crude oil price index, which is important for tracking and analyzing trends in the oil market. Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.

holidays_events.csv

It contains the date, type, locale, locale_name, description, and the transferred column. The “transferred” column in the data indicates whether a holiday is observed on a different date than its actual or traditional date.

Hypothesis

Null Hypothesis: The intensity of promotion (onpromotion) does not have a significant impact on the average sales of products.

Alternative Hypothesis: The intensity of promotion (onpromotion) has a significant impact on the average sales of products.

Questions

  1. Which dates have the lowest and highest sales for each year?
  2. Which stores are among the top 10 in terms of total sales?
  3. Did the sales data show any noticeable changes in sales patterns around the time of the 2016 earthquake?
  4. What are the five most frequently purchased product families?
  5. Is there a relationship between transactions and sales?
  6. Is there any association between oil prices and sales?

Exploratory Data Analysis

Here, we assessed each dataset separately and provided answers to the questions. Our primary emphasis was on the train dataset as it is the most important part of our predictive analysis.

After exploring each dataset with the .info method, we saw that each dataset’s attributes were appropriately assigned to their respective data types. With the exception of the oil dataset, which contained 43 missing values, all other datasets were complete and did not have any missing values.

Exploring the transaction dataset

Transaction time series

We observed an increase in transaction activity at the start of each year, which may be attributed to seasonal trends in the data. To delve deeper into this, we resampled the data for a more thorough analysis. Resampling helps us find and understand recurring patterns or trends in time series data.

Monthly transactions

After resampling the dataset by month, we discovered a consistent surge in transactions occurring on the 31st of December of every year.

Exploring the holiday dataset

Here is an overview of the holiday dataset
Count of holiday types
Count of locale
Distribution of the transferred column

Exploring the oil dataset

For the missing values in the oil dataset, we filled it using the backfill method. The backfill method takes the most recent known value (the last observed data point before the missing value) and copies it backward to fill in all the missing values until a new data point is encountered.

Oil time series

We also noticed a decrease in the oil price from 2014 to 2017.

Exploring the train dataset

Multivariate analysis

Correlation heatmap

Here, we observed that there was a moderate correlation of 0.43 between the sales and onpromotion columns. This suggests that as one of the variables increases, the other tends to increase. The other columns had low correlations.

After exploring each dataset, we concatenated the transaction, oil, train, and holidays datasets. To ensure consistency and assess the impact of various factors on sales within the same timeframe, we filtered these datasets to cover the period from January 1, 2013, to August 15, 2017, aligning them with the date range of our train dataset which contains our target variable, sales. Also, we excluded the ‘store’ dataset from our analysis because our objective revolved around forecasting unit sales across all Favorita stores rather than predicting sales for specific products within individual stores.

Hypothesis Testing

For this test, we used the t-test. The t-test is a statistical hypothesis test used to determine if there is a significant difference between the means of two groups in a dataset.

Hypothesis test result

From this result, we saw that the p-value was less than 5% so the null hypothesis was rejected.

Answering Questions

  1. Which dates have the lowest and highest sales for each year?

The dates with the lowest sales for each year recorded a sales value of 0.0. This could be due to various factors such as holidays, special events, or other circumstances that led to lower sales activity on those specific dates.

The dates with the highest sales information is valuable in helping us understand trends in sales performance and identify impactful events or promotions that led to these high sales days.

2. Which stores are among the top 10 in terms of total sales?

Top ten stores by total sales

The store numbered 44 stands out with the highest total sales. Following closely is store number 45, which achieved a total sales value of approximately 54,498,010. The store number 50 recorded the lowest total sales. This data provides valuable insights into the distribution of sales among the top-performing stores.

3. Did the sales data show any noticeable changes in sales patterns around the time of the 2016 earthquake?

Sales Patterns Before and After the 2016 Earthquake

The sales data showed some noticeable changes in sales patterns after the 2016 earthquake. There was a sudden increase in sales immediately after the earthquake event. The earthquake may have resulted in the property damage of individuals. This increase in sales can be attributed to the aftermath of the earthquake, where people needed to replace or repair damaged property, leading to heightened spending.

4. What are the five most frequently purchased product families?

Top 5 families by total sales

The top five families by total sales give a valuable understanding of customer preferences and enable us to optimize product offerings in these top-performing categories.

5. Is there a relationship between transactions and sales?

Correlation value

The correlation coefficient between transactions and sales is approximately 0.215. This indicates a positive correlation between these two variables. A positive correlation means that as the number of transactions increases, the sales tend to increase as well. However, the correlation coefficient of 0.215 suggests that the relationship between transactions and sales is not very strong.

6. Is there any association between oil prices and sales?

Correlation value

The correlation coefficient between sales and oil prices is approximately -0.079. This value indicates a weak negative correlation between the two variables. In other words, as oil prices increase or decrease, there is a very slight tendency for sales to decrease or increase respectively. However, the correlation is quite close to zero, which suggests that there is no strong linear relationship between sales and oil prices in the given data.

After providing responses to the questions, we proceeded to remove the unnecessary columns from the dataset. These columns included ‘family,’ ‘id,’ ‘locale_name,’ ‘store_nbr,’ and ‘description.’ We then checked for the missing values in the dataset.

Missing values in the data

In the case of transactions, we replaced the ‘nan’ values with the value zero, since it indicated that no transactions were recorded on those specific dates. For the holiday column, we replaced all the missing rows with non-holiday. Similarly, for the transferred column, we filled in the missing rows with “false” since the ‘nan’ signified that they were not transferred. As for the “dcoilwtico” column, we employed a backfill approach to address the missing values. However, for the sales and onpromotion columns, we removed the rows with missing values.

Overview of the merged data

Exploring the sales column

Here, we examined the sales trend.

Trends in the sales data

From this plot, we noticed a decrease in sales at the start of each year.

Stationarity Test

The stationarity test checks whether a time series data shows consistent statistical properties over time. It is important to check for stationarity in time series data because non-stationary data shows fluctuations, trends, or seasonality, making it harder to model and predict. Data is said to be stationary when the mean, variance, and autocorrelation are constant over time.

Null hypothesis: The series is not stationary.

Alternate hypothesis: The series is stationary.

We used the Augmented Dickey-Fuller Test to test for the stationarity of the series. The Augmented Dickey-Fuller Test is a statistical test used to determine whether a given time series is stationary or non-stationary.

Stationarity test result

From this result, we concluded that the series was not stationary.

To achieve stationarity in our time series, there are various methods one can use. However, we opted for a two-step approach. First, we decomposed the series, and then we applied the Box-Cox transformation to maintain constant variance.

Decomposing the series

Decomposing data is a valuable step in time series analysis because it helps identify and separate components that might make the data non-stationary, such as trends and seasonality. It also prepares data for modeling.

Decomposed Data

From the result above, we observed a rising sales trend over the period of time. As for the seasonality, there was a consistent decrease in sales at the beginning of each year. The residuals represent the noise or irregular fluctuations in the data after the trend and seasonality have been removed. It also exhibits more consistent variance.

Box-Cox Formation

The Box-Cox transformation is a mathematical technique that helps deal with unequal variance or non-normality in data.

Before Box-Cox Transformation
After Box-Cox Transformation

After the transformation, we saw that the spread of the data(variance) looked more consistent across the data range.

We then proceeded to check whether the data was stationary using the ADF test.

Stationarity Test Result

Our data is now stationary.

Correlation Charts

Autocorrelation and partial autocorrelation are important concepts in time series analysis. Autocorrelation is a measure of the extent to which lagged(previous) values are related to their current values. When autocorrelation is high, it suggests that past values have a strong influence on current values, indicating a trend in the data and vice versa.

ACF and PACF Plot

From the ACF plot, we saw that all the lag values were situated above the significance threshold. This indicates a strong autocorrelation pattern in the data that is, the values have a strong influence on the current value. The significant peaks at lags 7, 14, 21, and 27 in the ACF plot also indicate a strong seasonal component in the data.

We proceeded to divide the dataset into two parts: 80% for training and 20% for testing and then fitted our autoregressive(AR) model onto the training set.

Univariate Modeling Using Statistical Time Series Models

To facilitate modeling with statistical methods, which may not be well-suited for large datasets due to their computational limitations, we conducted an analysis of our target variable, sales. This analysis enabled us to work with these statistical models effectively. The training dataset was employed for statistical modeling after excluding all other columns in the training set except for the ‘dates’ and ‘sales’ columns.

with AutoReg model

An autoregressive model is a statistical model that predicts future values based on past values.

AR Model Result

The dependent variable tells us that the model is applied to the “sales” variable, which is the target variable being predicted. The dataset used for modeling contains 1,347 data. The model AutoReg(10) indicates that it’s an autoregressive model with 10 lagged values. The method used for the coefficient estimation is the conditional maximum likelihood estimation. The AIC and BIC measure the model’s goodness of fit. The lower the value, the better the fit.

The coefficients represent the estimated parameters for the autoregressive model. A positive coefficient indicates a positive correlation that is, if sales were higher at lag 1, the current sales value is likely to be higher as well. A negative coefficient suggests an inverse relationship. If sales were higher at lag 2, the current sales value is expected to be lower.

The roots of the autoregressive model indicate whether a model is stable or not.

After fitting the AR model on our train data, we went ahead to predict the sales of our 20% test data.

AR Model Forecast

We then assessed our model’s performance by utilizing metrics such as MSE, RMSE, MSLE, and RMSLE.

Mean Squared Error(MSE): It is a statistical measure of how close a model's predictions are to the actual values. It is calculated by taking the average of the squared errors between the predicted values and the actual values. The lower the MSE, the better the model is at predicting the actual values.

Root Mean Squared Error(RMSE): It is calculated by taking the square root of the Mean Squared Error (MSE).

Mean Squared Logarithmic Error(MSLE): It is a statistical measure of how close a model’s predictions are to the actual values while taking into account the scale of the values. MSLE is calculated by taking the average of the squared logarithmic errors between the predicted values and the actual values. It is a commonly used metric for evaluating the performance of regression models, especially for models that predict values on a logarithmic scale, such as sales. The lower the MSLE, the better the model is at predicting the actual values.

Root Mean Squared Logarithmic Error(RMSLE): It is calculated by taking the square root of the Mean Squared Logarithmic Error (MSLE).

Based on the results above, the AR model had a high MSE and RMSE value but a low MSLE and RMSLE value. Despite this, our primary focus for assessing the model’s performance was centered exclusively on the RMSLE metric which shows a performance of 0.42.

with ARIMA model

The Autoregressive Integrated Moving Average(ARIMA) model is a statistical model used to analyze and forecast time series data. ARIMA models are based on the assumption that the future values of a time series variable can be predicted based on its past values and errors.

ARIMA Best Model

We used the auto_arima function to perform a stepwise search for the best ARIMA model for our sales data. This function tries different combinations of ARIMA parameters (p, d, q) to find the model that minimizes a specified criterion, such as the AIC. Overall, the image shows that the ARIMA(5,1,5)(0,0,0)[0] model is the best model and this model has the lowest AIC. However, the total fit time value shows that the model is complex and requires a lot of time to train.

ARIMA Model Forecast

We proceeded to assess our model’s performance by employing the MSE, RMSE, MSLE, and RMSLE metrics.

with SARIMA Model

Seasonal Autoregressive Integrated Moving Average(SARIMA)is a statistical model used to analyze and forecast time series data with a seasonal component. SARIMA models are based on the ARIMA model, but they include additional parameters to account for the seasonality of the data.

SARIMA Model Forecast

We evaluated our model using the MSE, RMSE, MSLE, and RMSLE metrics.

We proceeded to combine the results of our three statistical models.

Combined result

After comparing our three statistical models, SARIMA performed the best by achieving the lowest RMSLE value of 0.29.

Feature Selection

We used the Granger Causality Test to select the features that have an impact on sales. The Granger causality test is a statistical hypothesis test for determining whether one-time series is useful in forecasting another. The Granger causality test is based on the idea that if one time series, X, is Granger-causing another time series, Y, then past values of X should be useful in predicting future values of Y. The test uses a regression model to predict future values of Y using past values of X and past values of Y itself. The null hypothesis posits that Y does not influence X while the alternate hypothesis posits that Y does influence X. If, after the test, the p-value exceeds the 5% threshold, we will accept the null hypothesis and vice versa. The Granger causality test works for numeric variables so we assumed our categorical variables had an impact on sales.

We performed this test on the ‘onpromotion’ and ‘dcoilwtico’ columns and the onpromotion and sales columns, considering a range of 30 lags.

The p-values for the ‘onpromotion’ and ‘sales’ columns were consistently below the 0.05 threshold over the 30-lag period, which meant that we could reject the null hypothesis and conclude that there is Granger causality between the two variables. In other words, the ‘onpromotion’ significantly influences the sales.

However, for the ‘onpromotion’ and ‘dcoilwtico’ columns, our analysis showed that the dcoilwtico variable did not have a statistically significant impact on sales. This is because the p-values for the lags 20 and beyond all exceeded the 5% threshold. As a result, we decided to remove the dcoilwtico variable from our analysis.

Feature Processing

In this section, we encoded our categorical columns and generated new features for our train and test dataset. These features included year, month day, day of the week, day of the year, week of the year, quarter, weekend, and day of the month.

Train data with new features

Subsequently, we proceeded with our modeling using machine learning models.

Multivariate Modeling with Machine Learning Models

with linear regression

with decision trees

with random forest regressor

We displayed our merged result afterward.

Combined Results

After comparing our univariate models to our multivariate models, linear regression had the lowest RMSLE of 0.26 making it the best model as compared to the others.

Enhancing Model Performance

In this section, we delved into techniques aimed at improving the model’s performance by selecting relevant features and assessing their importance.

with linear regression

Feature Importance In Linear Regression Model

Based on the information presented in the graph, we decided to remove the “is_weekend” and “day_of_month” columns from our model as they exhibited the lowest levels of importance.

with decision tree

Feature Importance In Decision Tree Model

Here, we dropped the transferred column and did the same with the random forest regressor. We went ahead to assess the performance of each of the models.

Model Evaluations

Conclusion

After comparing all of our models, we concluded that the linear regression model, which exhibits the lowest RMSLE, was the best-performing model. We then proceeded to use our best model to predict the sales in our test dataset. After, we combined our model predictions with the sales data and removed some of the features so as to structure it in a manner that resembled our submission.csv file.

Reference

Here is a link to my GitHub repository to view the notebook.

Appreciation

I highly recommend Azubi Africa for its comprehensive and effective programs. Read more articles about Azubi Africa here and take a few minutes to visit this link to learn more about Azubi Africa’s life-changing programs

Tags

Azubi Data Science

Thanks for reading!!!

--

--

Aliyyah

Computer science student, data science enthusiast and a chess lover