E-commerce Startup Sales Forecasting

How to Create an Optimal Forecast while Minimizing Lost Sales with Limited Data

Jsiwu

Published in

Analytics Vidhya

9 min readJul 17, 2020

Written by : Jennifer Siwu

Introduction

Forecasting Sales for E-commerce startup businesses comes with many challenges including limited historical data, competition from retail stores or other e-commerce businesses, extreme seasonality due to Peak Season (I will explain about this term later in this post), and etc.

In this post, I aimed to share how I addressed some of those challenges and identified the best approach in forecasting sales for common e-commerce startups’ business models. To visualize the result for business stakeholders, I created a basic Shiny Web Application at: jsiwu94.shinyapps.io/EcommerceSalesForecasting.

The final forecast generated through this analysis was able to minimize ~$1M in estimated annual lost sales.

If you are interested to learn more, all the codes are available on my github repository.

Dataset

The dataset came from a Brazilian E-commerce startup called Olist and was made available through Kaggle.com. It came in form of SQL database with many tables containing information from customers → orders → sellers. For this analysis, I will only be using the orders and products information from joining the 3 tables highlighted below.

There was 73 product categories and ~32K product ids in the dataset. The order purchase date ranges from Feb 1st, 2017 to July 31st, 2018. After combining the 3 tables together, I created a master data file that looks like below.

Exploratory Data Analysis

Sales and Revenue Pareto by Product Category
Amongst all 73 product categories, only 15–17 of them made up the top 80% sales volume and revenue. Based on that, I will focus on the product categories within the top 80% sales for this analysis (~11M in Average Annual Revenue).

2. Trend and Seasonality
The historical sales data showed some upward trend with an extreme jump on November 24th, 2017 which was a Black Friday. The fact that the sales was 3x of regular daily sales, and only happened in one day of the year makes it really difficult to forecast.

This phenomenon, commonly referred as “Peak Season”, is actually very common for any retail/e-commerce businesses. Peak Season period usually lasts from Halloween until a few weeks after New Years, with many days within those timeframe having very extreme sales.

The Decomposition plots indicated that there was upward trend in the sales data.

3. Seasonality and Trend by Product Category
Let’s zoom in on the time series a little bit to see the jump in black friday more closely for each product category. As we can see below, each of the top 3 product categories (based on sales volume) have different seasonality and peak effect.

Based on this information, I will do a bottom up forecast, creating separate model for each product category and summing it up for the overall forecast. Then, I will allocate the category level forecast to each item based on the their % of sales within the category from the last 3 month. Each product within the same category will more or less follow the same seasonality and trend.

4. Annual, Monthly, and Weekly Seasonalities
For the Yearly Seasonality, I only used the first year’s data since we didn’t have the full second year data. Based on the plots below, we can definitely see strong seasonality within the weeks and months. I will use these information to tune my model parameters later.

Approach

Before building the forecasting model, I first need to identify the goal and metrics that are appropriate for Olist’s business model. Let’s take a look at the considerations below:

As an E-commerce startup, the main business concern is to maximize growth and revenue.
Additionally, Olist does not have any physical warehouses as the items are shipped directly from sellers to customers. Therefore, Inventory Cost is not a concern in this case.
Lastly, the main competitive advantages of an e-commerce over retail stores are the availability and variety of products it offers.

With all of these in mind, my goal in this forecast would be to minimize lost sales due to under-forecasting (forecast < actual). Therefore, the metrics I will use to evaluate the final forecast performance would be the estimated annual lost sales in dollar. Additionally, I will use AIC and RMSE to compare the models.

TRAIN & TEST SPLIT
To evaluate the model performances later, I split my data into train and test samples with May 1st 2018 as the cutoff date (Train data prior to the cutoff date and Test data post the cutoff date).

Since this is a time series, the split cannot be done randomly, and the holdout or test data should avoid irregularities.

SARIMA

Differencing to Stationary
In order to fit an Arima Model, we need to make sure our data is stationary (having constant mean or variance overtime). Once I differenced the data by a week, the time series data became stationary as shown below and validated by the Augmented Dickey-Fuller Test.

diff1 <- diff(tsdiffts, differences = 7)
autoplot(diff1, ts.colour = "#213ee2")
adf.test(diff1, nlag = 1)

Analyzing ACF and PACF
Below are the comparisons of Autocorrelation and Partial Autocorrelation before and after differencing. As shown below, the ACF and PACF plots shown before differencing shows a lot of correlated lags. However, using the PACF and ACF plots produced after differencing the data, we can finally identify the parameter orders for our SARIMA model.

Training SARIMA Model
Based on the PACF and ACF plots above, I estimated an ARIMA of (4,1,2) with seasonal order of (0,1,2) and a repeat seasonality pattern of 7 days based on the learnings from the Exploratory Data Analysis section earlier.

Model Result
The SARIMA model’s residual analysis, specifically the residual ACF plot, indicated that the model was good and unbiased. However, the density curve was skewed due to the extreme sales on Black Friday. In this case, I will not treat this data as an outlier because we do actually want to consider that jump in our next peak season forecast.

TSLM

Training TSLM Model
I addition to SARIMA, I built a Multivariate Regression model to compare the performance. To capture the seasonality better, I created dummy variables for months and weeks. Furthermore, I added fourier within the month.

fit.tslm <- traintsb %>% 
 model(TSLM(sales_volume ~ month + trend() + season(period="week") + fourier(period = "month",3)))
fit.tslm %>% report()fc.reg <- fit.tslm %>% predict(testtsb)
fc.reg %>% accuracy(testtsb)

Model Result
While the model yielded a much lower AIC as compared to the SARIMA above, it still didn’t quite capture the correct trend and seasonality on the test dataset as shown below.

SARIMAX

Training SARIMAX Model
Using SARIMAX, I aimed to combine the additional seasonality regressor that we used in Multivariate Regression into the SARIMA model. Additionally, I incorporated an external data, which was the Brazilian National Holiday, as a regressor in this model while adding fourier for week, month, and year to better capture the seasonality.

fit.arlm <- traintsb %>% 
        model(SARIMAX = ARIMA(sales_volume ~ month + weekday +
                                             holiday + day + 
                                             trend() + 
                                        fourier(period = "week",1) +
                                       fourier(period = "month",3) +
                                       fourier(period = "year",5)))
fit.arlm %>% report()fc.arlm <- fit.arlm %>% predict(testtsb)
fc.arlm %>% accuracy(testtsb)

Model Result
This model was a lot more complex than all the previous iterations and yet still could not capture the seasonality curve in the test dataset.

SARIMA, SARIMAX, ETS, and TSLM

Comparing all the classical time series models explained above with an addition of ets, the top 2 models based on their performance on test data were ETS and SARIMA.

ENSEMBLE TS MODEL

Since SARIMA’s trend was a little too high for this dataset, I decided to create an ensemble model averaging ETS and ARIMA predictions. This model was my preferred model from amongst all the previous iterations.

fit_ensemble <- traintsb %>% model(ensemble = (ARIMA(sales_volume ~ 0 + pdq(4, 1, 2) + PDQ(0, 1, 1, period = 7) +
                                               ETS(sales_volume))/2)fc_ensemble <- fit_ensemble %>% predict(testtsb)
fcst_ensemble <- fit_ensemble %>% predict(alltsb)fc_ensemble %>% accuracy(testtsb)

FB PROPHET

Now that I have tried multiple combinations of classical time series forecasting model and learned that none of them could quite capture the trend and seasonality in this data, I attempted fitting the data into Facebook Prophet algorithm.

Using an automatic prophet model without any parameter tuning would yield the same result as any of the previous time series models. Therefore, my goal in this case would be to find the optimal parameters which produce the lowest test RMSE.

Parameter Tuning
To tune the model parameters, I created custom seasonality aspects and the parameters tuned included the change point scales, prior scales and fourier orders. I also chose the growth to be “linear” for my prophet models as the growth would relatively be linear for most startup companies and it would take a while until it reaches maturity or starts flattening out.

Model Result
The optimal parameters found based on minimizing test RMSE were as below.

prophet(growth="linear",
              holidays = hldydf,
              daily.seasonality=FALSE, 
              yearly.seasonality = FALSE, 
              weekly.seasonality = FALSE,
              seasonality.mode = "multiplicative",
              changepoint.prior.scale = 0.04,
              seasonality.prior.scale = 20,
              holidays.prior.scale = 35) %>% 
                    add_seasonality(name="monthly",
                                  period=30.5,
                                  fourier.order = 45) %>% 
                    add_seasonality(name="weekly",
                                  period=7,
                                  fourier.order = 20) %>% 
                    add_seasonality(name="yearly",
                                  period=365.25,
                                  fourier.order = 20) %>% 
                    add_seasonality(name="quarter",
                                  period=365.25/4,
                                  fourier.order = 5,
                                  prior.scale = 15)

As shown on the graph, this model was able to correctly capture the trend and seasonality in the dataset. The test RMSE was also lower than the preferred classical time series model, which was an ensemble of ETS and ARIMA. Based on this, I will chose this model as the best model and use it to forecast for Olist future sales.

Final Result

With the tuned prophet model, I was able to correctly capture the trend & seasonality in the data while minimizing the lost sales by over $1M as compared to the ensemble model based on ETS and ARIMA.

Demo

To present the result to the stakeholders in an actionable format, I created a basic Shiny WebApp. The app visualize forecast and actual comparisons for each product categories and the actual forecasted numbers in a table that can easily be pasted into excel (link here).

Key Takeaways

Forecasting with limited historical data and extreme seasonality is challenging but with proper business model assessment and understanding, it still can help the company to minimize lost and maximize gain. In real business application, these forecast numbers are still “raw” in that they do not consider any future promotional plans or price changes. Therefore, once the analysts are done with these forecasts, they usually meet with the finance and marketing teams to discuss any of these external factors that weren’t consider in the forecast and re-adjust the forecast accordingly.