Time series analysis for demand forecasting

45 min readOct 10, 2023

Clique aqui para ler esse artigo em Português.

The analysis of time series aims at identifying patterns, forecasting trends, and making decisions, which can be applied in various fields as it aids in understanding the behavior of something over time.

In this article, I demonstrate the path I followed to find the best predictive model for wine demand among the models with the Naive approach, Moving Average, Holt’s Linear Trend Model, ARIMA, Prophet, and PyCaret.

* Note

This is a complete study report, including the codes and methodology used. I published a brief version of it, more direct, where I bring only the main results of this research.

To check the summarized article click here.

Summary

About the Project
1.1. Demand Forecasting
1.2. Time Series
1.3. Business Understanding
General Objective
2.1. Specific Objectives
Obtaining the Data
Variable Dictionary
Data and Library Importation
Exploratory Data Analysis
6.1. Why use Pandas Profiling?
6.2. Generate Report
6.2.1. Products Report
6.2.2. Sales Report
Data Preprocessing
7.1. Products Dataset
7.2. Sales Dataset
7.3. Joining Datasets
Data Analysis
8.1. Remove dates from 2016 and 2017
8.2. Most expensive and cheapest wine
Feature Engineering
Business Analysis
10.1. Wines: best sellers x highest revenue
10.2. Producers: median price x quantity of products
10.3. Producers: best sellers x highest revenue
10.4. Region: quantity of product x revenue
Demand Forecast
11.1. Data Preparation
11.2. Stationary Series
11.2.1. Augmented Dickey-Fuller (ADF) Test
11.2.2. Transform Series
11.3. Split the set into training and validation data
11.4. Make predictions
11.4.1. Naive Approach
11.4.2. Moving average
11.4.3. Holt’s Linear Trend Model
11.4.4. ARIMA
11.4.5. Prophet
11.4.6. PyCaret
Model Assessment
12.1. Reverse transformations for stationary series
12.2. Charts
12.3. Assessment Metrics
Conclusion

1. About the Project

1.1. Demand Forecasting

Being able to predict demand for a product or service is a strategic technique that can assist in decision-making, planning, inventory management, resource optimization, and even customer satisfaction. There are many applications where this type of tool can be applied in companies from various sectors.

In this study, we will be based on the requirements specified in a real job posting on LinkedIn. The contract called for presenting a sales forecasting solution to assist with inventory management for the company. It is also noted that the peculiar challenges of the wine industry should be taken into consideration.

Every project for demand forecasting requires the analysis of historical data, statistical modeling, and the use of machine learning algorithms. Therefore, they mentioned that they would provide data from past sales.

1.2. Time Series

This type of problem is typically solved using time series analysis, which involves data distributed at sequential and regular intervals over time. This means that data is collected at time intervals such as hourly, daily, monthly, etc. Additionally, each new data point depends on the previous data points. If it weren’t for the connection between the data and the time variable, the problem could be solved with linear regression (Analytics Vidhya, 2018).

1.3. Business Understanding

Every project requires an understanding of the specific case under study. Understanding its peculiarities helps the data scientist in choosing the best parameters and machine learning algorithms to construct a predictive model that fits the specific case.

In this case, we are dealing with a core product: wines. They are considered living beverages, meaning they are in constant evolution. For this reason, they require ideal conditions to maintain their characteristics and not spoil.

Among the necessary precautions are storage and transportation to prevent inventory loss, making the goal of this project even more critical. These precautions can make the difference between the company making a profit or incurring losses in the long term.

Sunlight must be avoided, and wines need to be kept in an environment with controlled and constant humidity and temperature. It is more important that the temperature remains constant than what the specific temperature is. Typically, the target temperature is between 10 and 15 degrees Celsius, varying depending on the type of wine.

It’s also important to avoid movements such as shaking, vibrations, or any kind of disturbance to the liquid. This indicates that inventory management needs to be done optimally to minimize any unnecessary handling of the wines. If possible, it’s recommended to keep them in a horizontal position, which may also need to be considered when planning the storage where the wines will be kept.

For these reasons, optimizing turnover is necessary so that the wine spends the least amount of time possible in a position that could deteriorate it, ensuring that it is delivered to customers in an excellent consumable condition.

Although some wines improve with age, the vast majority are intended for “immediate” consumption, from the moment they arrive on the market, up to a maximum of 1 to 3 years. Therefore, it’s vital not to keep wines in stock for a longer period than that.

2. General Objective

Develop an algorithm based on machine learning to predict demand for wines in a store specializing in this product.

2.1. Specific Objectives

Perform an exploratory data analysis to understand the dataset and check for the presence of abnormalities that require appropriate treatments.
Create machine learning models using different algorithms.
Evaluate models to find the one with the best performance.

3. Obtaining the Data

The data used in this project was provided by Rafael Duarte, a data scientist and wine enthusiast, who created this dataset due to the difficulty of finding sales data in this sector. He points out that:

The wine list, that is, the products, are real and based on the actual offerings of an e-commerce wine store here in Brazil. Names, vintages, and prices are 100% real, converted from Brazilian reais to dollars to have a more international appeal.
The sales dataset is based on a Kaggle competition. It originally consisted of 5 years of daily sales data, distributed across 10 stores, with a catalog of 50 products. This dataset was modified to include 3 years of daily sales data, distributed across 3 stores, and with 219 different products in stock.
The data is divided into two files: one with historical sales data and the other with information about the wines. Both have been saved in the cloud in case of unavailability or future changes and these files, saved by me, will be used in the code to maintain the integrity of the study.

It is worth noting that, since this dataset contains synthetic data, even though it is based on real data, we may encounter problems that would not be found with real data. This should be a point of attention throughout the analysis.

4. Variable Dictionary

Understanding the dataset involves checking the available variables in it so that a proper analysis can be conducted. Here is a summary of the attributes found and their respective meanings. Since we have two datasets, produtos.csv and vendas.csv their variables are in two separate tables.

produtos.csvin alphabetic order

country: wine’s country of origin
item_id: item identification number
kind: type of wine: sparkling, rose sparkling, white, rosé, and red
name: wine’s name
price_brl: price in Brazilian reais
price_usd: price in US dollars
producer: wine’s producer
region: wine production region
vintage: harvest year

The sales set also has the item_id attribute through which you can return to the product set and obtain more details about the item purchased by the customer.

vendas.csvin alphabetic order

date: date of sale
item: item identification number
store: the number to identify the store
sales: sold amount

5. Data and Library Importation

When starting a project, it’s necessary to install packages, import libraries that have specific functions to be used in the following lines of code, and make the necessary configurations for the code’s output. Additionally, you proceed with importing the dataset, saving it in a specific variable for later use.

# install additional packages
!pip install pycaret -q          # auto machine-learning
!pip install pandas_profiling -q # production of exploratory analysis report
!pip install pmdarima -q         # auto arima
!pip install pycaret -q          # pycaret

# import libraries
import pandas                  as pd                   # data manipulation
import numpy                   as np                   # array maniputlation
import matplotlib.pyplot       as plt                  # data visualization
import seaborn                 as sns                  # statistical data visualization
import datetime                as dt                   # manipulation of dates and times
from pandas_profiling            import ProfileReport  # production of exploratory analysis report
from statsmodels.tsa.stattools   import adfuller       # adf test (stationary series)
from sklearn.metrics             import mean_absolute_error as mae              # evaluation metric
from sklearn.metrics             import mean_absolute_percentage_error as mape  # evaluation metric
from statsmodels.tsa.seasonal    import seasonal_decompose # decompose time series
from statsmodels.tsa.holtwinters import Holt               # Holt's Linear Trend Model forecasting method
from pmdarima.arima              import auto_arima         # autoarima
from prophet                     import Prophet            # time series
from pycaret.time_series         import *

import warnings                   # notifications
warnings.filterwarnings('ignore') # set notifications to be ignores

# additional settings
plt.style.use('ggplot')
sns.set_style('dark')

# import data sets and save them into variables
df_products_raw = pd.read_csv('https://www.dropbox.com/scl/fi/t8wr73843nxvcw949twi3/products.csv?rlkey=qrms57xtev2rzqzpxuqoc9yif&dl=1')
df_sales_raw   = pd.read_csv('https://www.dropbox.com/scl/fi/hlrspaogkay44qe45bbf4/sales.csv?rlkey=j9rlkn2r97j3swfpg9irckvgm&dl=1')

6. Exploratory Data Analysis

This is an essential step in data science projects where the goal is to better understand the data by identifying patterns, outliers, potential relationships between variables, and more. In this study, we explored information that was relevant to guide the answers to the objectives mentioned earlier (see General Objective and Specific Objectives).

To do this, I will generate a report that summarizes the data using the ProfileReport library. From this report, if necessary, a more in-depth analysis will be conducted. However, it will already provide us with sufficient information to identify anomalies such as outliers and data imbalances.

6.1. Why use Pandas Profiling?

The Pandas Profiling library offers the ProfileReport function, which automatically generates a comprehensive exploratory data analysis report. It includes the following information:

Overview
Provides the dataset’s record and attribute count, as well as information about missing values and duplicates.
Descriptive Statistics
Calculates statistics such as mean, median, maximum and minimum values, standard deviation, quartiles, and other statistical measures for each variable.
Distribution
For numeric variables, it creates visualizations such as histograms, box plots, and density plots to facilitate a visual understanding of data distribution.
Frequency
Identifies the main unique values and their respective counts for numeric attributes.
Correlation
Generates a correlation matrix to assess the relationships between variables.

Using ProfileReport streamlines the routine of exploratory data analysis, but it doesn’t exempt the data scientist from conducting an in-depth study of the report to understand and interpret the data, as well as to identify any potential issues that require treatment.

6.2. Generate Report

In this step, I will generate the reports and then print out the visualizations so that you can analyze the statistics of the dataset.

# create reports
report_products = ProfileReport(df_products_raw)
report_sales    = ProfileReport(df_sales_raw)

6.2.1. Products Report

# view report from 'products' dataset
report_products.to_notebook_iframe()

The report generated with the code above reveals several key insights:

The dataset consists of 219 records and 9 attributes.
Upon inspecting the first and last entries of the dataset, it appears to be well-filled and free of anomalies.
There are no missing data.
No duplicate data entries were found.
There is one numeric column, and the rest are categorical, which may require further examination since some categorical variables, such as price-related ones, may not be in the correct data type.
Four attributes with high cardinality (many distinct values) were detected, which is consistent with the nature of these attributes (name, producer, price_brl, and price_usd).
Strong correlations were identified between item_id and producer, country and item_id, region and item_id, vintage and kind, producer and kind.
The variables item_id, vintage, price_brl, and price_usd are currently stored as strings but may need to be converted to more appropriate data types.
In the country variable, there are a total of 6 different countries, with France representing 70% of the dataset, followed by Italy with 10% and Spain with 8%.
In the region variable, Burgundy accounts for 32% of the dataset, followed by Bordeaux with 21%, and Rhone with 6%.
In the vintage variable, there are a total of 17 different vintages, with the 2018 vintage representing 30% of the dataset, followed by 2017 with 22%, and both 2015 and 2014, each with 8.7% representation.
The vintage attribute also contains the acronym NV which stands for Non-Vintage. It will need to be changed to 0 to convert the entire attribute to an integer type.
In the kind variable, red wine represents 60% of the dataset, followed by white wine at 31.5% and sparkling wine at 4.6%.
In the producer variable, there are a total of 58 different producers, with Domaine Ponsot being the most prevalent in this dataset at 5.5%, followed by La Chablisienne at 4.6% and Domaine Matrot at 4.1%.

Therefore, the necessary data preprocessing steps involve changing the data types of the variables item_id to string, price_brl and price_usd to float. In the vintage variable, before converting it to an integer, you should replace NV values with 0 to facilitate the conversion. Changing the data type of item_id can help avoid any hierarchical associations in the machine-learning model.

6.2.2. Sales Report

# view report from 'sales' dataset
report_sales.to_notebook_iframe()

The report generated with the code above reveals several key insights:

The dataset consists of 720,071 records and 4 attributes.
Upon inspecting the first and last entries of the dataset, it appears to be well-filled and free of anomalies.
There are no missing data.
No duplicate data entries were found.
There are 2 numeric columns and 2 categorical columns, but the data type of the date variable may need to be checked as it likely isn’t in the correct format.
The dates are recorded daily but do not follow a consistent pattern and are not in datetime format.
Based on the initial and final records, it appears that the registration period in this dataset is from January 1, 2018, to December 31, 2020.
Additionally, the dates repeat for each item sold.
The store variable is in string format and appears to be balanced.
The item variable is in float format, and there are 219 distinct values, which corresponds to the number of items in the product dataset.
The sales variable is in float format.

The necessary data preprocessing steps involve converting the item attribute to string, sales to int, and converting the date variable to the datetime data type. As mentioned in the product report, changing the data type of the item attribute can help avoid any hierarchical associations in the machine learning model.

7. Data Preprocessing

In this stage, the necessary procedures will be carried out to proceed with the algorithm modeling, as identified in the previous section.

7.1. Products Dataset

In the products dataset, the following treatments will be performed:

Change item_id to string.
Change price_brl and price_usd to float.
In the vintage variable, replace NV values with 0.
Change vintage to int.

We start by making a copy of the dataset to ensure that we have a modified dataset while keeping the original dataset intact.

# make a copy of 'products' dataset
df_products_clean = df_products_raw.copy()

# change 'item_id' to 'str' type
df_products_clean['item_id'] = df_products_clean['item_id'].astype('str')

# check change
type(df_products_clean['item_id'][0])
'''
str
'''

# change 'price_brl' and 'price_usd' to 'float'
#df_products_clean['price_brl'] = df_products_clean['price_brl'].astype('float')
#df_products_clean['price_usd'] = df_products_clean['price_usd'].astype('float')

# check changes
type(df_products_clean['price_brl'][0])
'''
str
'''

In the code above we have an error in the type conversion because a value with a comma was found, so I will first remove the comma and then convert the variable’s type to float.

# exclude commas
df_products_clean.replace(',', '', regex=True, inplace=True)

# change 'price_brl' and 'price_usd' to 'float
df_products_clean['price_brl'] = pd.to_numeric(df_products_clean['price_brl'])
df_products_clean['price_usd'] = pd.to_numeric(df_products_clean['price_usd'])

# check changes
type(df_products_clean['price_brl'][0])
'''
numpy.float64
'''

# convert 'NV' values from 'vintage' attribute to 0
df_products_clean['vintage'].replace('NV', 0, inplace=True)
df_products_clean.head()

# change 'vintage' to 'int' type
df_products_clean['vintage'] = df_products_clean['vintage'].astype('int64')

# check change
type(df_products_clean['vintage'][0])
'''
numpy.int64
'''

7.2. Sales Dataset

In the sales dataset, the following treatments will be performed:

Change item to string.
Change sales to int.
Change date to datetime.

We start by making a copy of the dataset to ensure that we have a modified dataset while keeping the original dataset intact.

# make copy of the 'sales' dataset
df_sales_clean = df_sales_raw.copy()

# change 'item_id' to 'str' type
df_sales_clean['item'] = df_sales_clean['item'].astype('str')

# check change
type(df_sales_clean['item'][0])
'''
str
'''

# change 'sales' to 'int'
df_sales_clean['sales'] = df_sales_clean['sales'].astype('int64')

# check change
type(df_sales_clean['sales'][0])
'''
numpy.int64
'''

# change 'date' to 'datetime'
df_sales_clean['date'] = pd.to_datetime(df_sales_clean['date'])

# check change
type(df_sales_clean['date'][0])
'''
pandas._libs.tslibs.timestamps.Timestamp
'''

7.3. Joining Datasets

To facilitate a more detailed analysis of this data, I will combine the two datasets products and sales.

# rename 'item' attribute to 'item id' and make it the same as the 'products' dataset
df_sales_clean.rename(columns= {'item': 'item_id'}, inplace=True)

# join the 'sales' and 'products' datasets according to the 'item_id' attribute
df_merged = df_products_clean.merge(df_sales_clean, on='item_id', how='right')

# print dataset size
print(df_merged.shape)

# check the first entries of the dataset
df_merged.head()

We can confirm that the merge was successful because we have the same 720,071 records in the merged dataset as we did in the original sales dataset. Regarding the attributes, we had 9 in the products dataset and 4 in the sales dataset. Since both datasets contained the relationship variable item_id, it's only counted once, and that's why we expected to find 12 attributes, as shown above.

Let’s also check the final records of the dataset and ensure that the data types of the variables have been preserved.

# check latest entries in 'df_merged'
df_merged.tail()

# check type of variables
df_merged.info()

8. Data Analysis

Now that the data has been processed, we can generate the report using Pandas Profiling again. This time, we have relevant attributes related to monetary values and sales, which, due to being in the wrong data type previously, couldn’t be analyzed. This report will provide us with valuable insights and analysis regarding these attributes and the dataset as a whole.

# create report
report = ProfileReport(df_merged)

# view report
report.to_notebook_iframe()

The report generated with the code above provides several key insights:

The dataset consists of 720,071 records and 12 attributes.
Upon inspecting the first and last entries of the dataset, it appears to be well-filled and free of anomalies.
There are no missing data.
No duplicate data entries were found.
There is one datetime column, five numeric columns, and six categorical columns.
As before, the cardinality of some attributes and high correlations between some of them were detected.
The vintage attribute now has 3.7% of its values as 0 due to the treatment applied to this variable.
In the date column, there are some periods from 2016 and 2017 that have appeared but do not have records.

It’s worth highlighting some important statistics:

Other information regarding the harvest, country, region, and producers remains the same.

8.1. Remove dates from 2016 and 2017

As the dataset refers to 3 years of data, these data are probably remnants of the synthetic construction. I will proceed with the exclusion of this data for 2016 and 2017.

# remove data for 2016 and 2017
df_clean = df_merged.loc[(df_merged['date'].dt.year != 2016) & (df_merged['date'].dt.year != 2016)]

# check deletion, ordering data from oldest to newest
df_clean.sort_values(by='date')

df_clean.head()

8.2. Most expensive and cheapest wine

Let’s identify the most expensive and cheapest wine.

# identify the more expensive wine
df_clean[df_clean['price_usd'] == 1901.73].iloc[0]

# identify the cheaper wine
df_clean[df_clean['price_usd'] == 9.13].iloc[0]

9. Feature Engineering

The process of feature engineering involves creating new variables from existing data with the aim of improving the performance of a machine-learning model.

In this case, since we have the time variable and are conducting a Time Series analysis, it’s beneficial to separate it into as many individual attributes as possible. This means breaking it down into day of the month, day of the week, day of the year, week of the year, month, quarter, and year. These new features can provide valuable insights and patterns for the Time Series analysis.

# create attributes from date
df_clean['Year']     = df_clean.date.dt.year       # year
df_clean['Quarter']  = df_clean.date.dt.quarter    # quarter
df_clean['Month']    = df_clean.date.dt.month      # month
df_clean['Week']     = df_clean.date.dt.week       # week of the year
df_clean['Weekday']  = df_clean.date.dt.weekday    # day of the week
df_clean['Day']      = df_clean.date.dt.day        # day of month
df_clean['Dayofyear']= df_clean.date.dt.dayofyear  # day of the year

# check changes
df_clean.head()

Now that we have information about the day of the week (0 corresponds to Monday and so on, ending at 6, which refers to Sunday), I will construct a new attribute to distinguish between weekdays and weekends. This can be a valuable feature for modeling and analyzing time series data, as the behavior of sales or other variables may differ on weekdays compared to weekends.

# create attribute 'Weekend' with values equal to 0
df_clean['Weekend'] = 0

# convert values corresponding to the weekend to 1
## Saturday = 5
## Sunday = 6
df_clean.loc[(df_clean.Weekday == 5) | (df_clean.Weekday == 6), 'Weekend'] = 1

# check changes
df_clean.head()

# check final records
df_clean.tail()

Let’s visually check the average amount of sales during the week and on weekends.

# plot graph with average sales value during the week x weekend
df_clean.groupby('Weekend').sales.mean().plot.bar();

It’s noticeable that the average values are very close.

Another piece of information we can extract is the total revenue generated by each product per day. To do this, we can multiply the wine’s price attributes, given by price_usd and price_brl, by the number of units sold, which is given by sales. This will give us the daily revenue generated by each product, providing valuable insights into product performance.

# create new attribute 'total_amount_usd' and total_amount_brl' to report revenue
df_clean['total_amount_usd'] = df_clean['price_usd'] * df_clean['sales']
df_clean['total_amount_brl'] = df_clean['price_brl'] * df_clean['sales']

# check changes
df_clean.head()

It is worth checking the main statistics of these new attributes.

# print statistical report
df_clean.describe().round(2)

In the total_amount_usd and total_amount_brl columns, we can observe that the average spending is approximately 12,138 USD or 70,158 BRL. However, due to the high standard deviation, we have a lower median value of 5,051 USD or 29,198 BRL. This can also be seen in the difference between the 3rd quartile and the maximum value, indicating the presence of outliers. Nevertheless, I will retain these values in the dataset since they may occur and have an impact in the future as well. Thus, it's important to include them in the data for the upcoming time series analysis.

10. Business Analysis

Up to this point, we’ve conducted exploratory data analysis with the aim of understanding the provided dataset and checking for anomalies that could impact both dataset comprehension and the construction of predictive models. We have performed the necessary data preprocessing and added relevant variables for time series analysis, such as breaking down the date attribute.

However, we haven’t yet gained a deeper understanding of the data, particularly regarding insights that are relevant for the company to better understand its demand and how to leverage this information to improve sales, offerings, inventory management, and ultimately increase profits.

In this section, I will specifically analyze the supply side to understand which regions sell the most in terms of volume and product value. This study is relevant because we can have a wine that sells 10 bottles daily and costs 10 USD each, which would correspond to 700 USD per week. On the other hand, we can have a bottle priced at 700 USD but only sell one unit per week. Therefore, in terms of revenue, they are equivalent. So, which one would be better to have in stock? Is it possible to sell more of the 10 USD wine or the 700 USD wine? These are the questions we will try to answer with a more detailed analysis.

10.1. Wines: best sellers x highest revenue

First, I will construct two separate sets:

The top 20 wines that generated the most revenue (and then collect the information about units sold).
The top 20 wines sold by units (and then collect the revenue information).

With this data, I can create a scatter plot that shows the relationship between the revenue generated and the number of units sold for the top 20 wines. These are two separate sets because, when ranked by revenue, a wine may not always be among the top 20 in terms of units sold, and vice versa. This analysis will help us understand the performance of different wines in terms of revenue and sales volume.

# top 20 wines that generated the most revenue
top_amount       = df_clean.groupby('name')['total_amount_usd'].sum().sort_values(ascending=False)
top_amount_sales = df_clean.groupby('name')['sales'].sum().sort_values(ascending=False)
top_amount_merged = pd.merge(top_amount, top_amount_sales, on='name', how='left')
top_amount_merged['rank'] = range(1, (len(top_amount)+1))

top_amount_merged[:20]

# top 20 best-selling wines per unit
top_sales       = df_clean.groupby('name')['sales'].sum().sort_values(ascending=False)
top_sales_amount= df_clean.groupby('name')['total_amount_usd'].sum().sort_values(ascending=False)
top_sales_merged= pd.merge(top_sales, top_sales_amount, on='name', how='left')
top_sales_merged['rank'] = range(1, (len(top_amount)+1))

top_sales_merged[:20]

# plot scatter plot
# the 20 best-selling wines per unit x highest revenue
# configure chart

top20_amount = top_amount_merged.iloc[0:20]
top20_sales  = top_sales_merged.iloc[0:20]

fig, ax = plt.subplots(figsize=(10, 5))

ax.scatter(x=top20_amount['total_amount_usd'].values, y=top20_amount['sales'].values, c='red', s=70, alpha=0.4)
ax.scatter(x=top20_sales['total_amount_usd'].values, y=top20_sales['sales'].values, c='blue', s=20, alpha=0.4)
ax.set_title('Top 20 best-selling wines')
ax.set_xlabel('Revenue')
ax.set_ylabel('Units sold')
ax.legend(['Revenue', 'Units sold'], loc='upper center')

plt.show()

The points of the two groups were plotted in different colors and sizes to facilitate interpretation. From the graph above, we can observe two isolated points in the top right corner, which correspond to the top 1 in terms of units sold and the top 1 in terms of revenue. Therefore, it’s a wine that has a high number of units sold and represents the highest revenue for the company.

To confirm this information, we can refer back to the lists above and verify that it is indeed the wine ‘Domaine Ponsot Chapelle-Chambertin Grand Cru.’ This information can also be confirmed with the code provided.

# print top 1 wine with the most units sold and the top 1 that generated the most revenue
print('The top 1 wine with the most units sold: {}\n'
      'The top 1 wine that generated the most revenue:     {}'.format(top_sales_merged.index[0], top_amount_merged.index[0]))
'''
The top 1 wine with the most units sold: Domaine Ponsot Chapelle-Chambertin Grand Cru
The top 1 wine that generated the most revenue:     Domaine Ponsot Chapelle-Chambertin Grand Cru
'''

I’ll check the unit price of this wine:

# identify the price of the best-selling wine that generates the most revenue
top1_price = df_products_clean.loc[df_products_clean.name == 'Domaine Ponsot Chapelle-Chambertin Grand Cru']
print('Domaine Ponsot Chapelle-Chambertin Grand Cru wine costs US$ {} per unit.'.format(top1_price['price_usd'].values[0]))
'''
Domaine Ponsot Chapelle-Chambertin Grand Cru wine costs US$ 656.06 per unit.
'''

As seen in the exploratory analysis of products, this wine is one of the most expensive wines that the company sells, as its price is already in the 95th percentile, indicating that it has a higher price than 95% of the other wines in our dataset.

Another noteworthy example in the scatter plot is the second-highest point on the graph, marked in blue and on the left side. This indicates that despite being the second-highest wine in terms of units sold, it generates significantly lower revenue compared to the others.

Let’s further investigate this.

# print top 2 wines with the most units sold
print('The top 2 wines with the most units sold: {}'.format(top_sales_merged.index[1]))
'''
The top 2 wines with the most units sold: Potensac
'''

# check its placement in the top 20 revenue
# search top2 wine ranking 'Potensac'
top2_in_amount = top_amount_merged.loc[top_amount_merged.index == 'Potensac']
print(top2_in_amount['rank'])
'''
name
Potensac    33
Name: rank, dtype: int64
'''

# identify price of top2 wine in units sold
top2_in_amount_price = df_products_clean.loc[df_products_clean.name == 'Potensac']
print('Potensac wine costs US$ {} per unit.'.format(top2_in_amount_price['price_usd'].values[0]))
'''
Potensac wine costs US$ 112.32 per unit.
'''

We can see that the wine ‘Potensac,’ which is the 2nd highest in terms of the number of bottles sold, is ranked 33rd in terms of revenue and has a price of 112 USD. According to our exploratory analysis of the products, this price falls between the median of 88 USD and the 3rd quartile of 164 USD, indicating that it is a moderately priced wine, leaning towards the expensive side. In other words, it is not one of the more budget-friendly options among those available from the company.

10.2. Producers: median price x quantity of products

Another interesting relationship to examine is the number of different products each producer offers. Depending on the situation, it might be relevant to strategize on purchasing larger quantities from certain producers to obtain bigger discounts and increase profits on specific products, for example.

Please note that the company currently does business with 58 different producers.

I will construct two separate sets:

The top 20 producers with the highest median price per bottle of wine.
The top 20 producers with the highest quantity of products sold by the company, per producer.

I will use the median as it is a statistic less sensitive to outliers. Based on this information, I will create a scatter plot that shows the relationship between the highest median prices by producer and the quantity of products sold by producer. This analysis can provide insights into the pricing strategies and product diversity of different producers.

# producers: quantity of product x median price
producer_median_price = df_clean.groupby('producer')['price_usd'].median().sort_values(ascending=False)
producer_items = df_clean.groupby('producer')['item_id'].nunique().sort_values(ascending=False)

# gather information
producer_mp = pd.merge(producer_median_price, producer_items, on='producer', how='left')
producer_it = pd.merge(producer_items, producer_median_price, on='producer', how='left')

# print list with top 20 producers by median price
producer_mp_top20 = producer_mp.iloc[:20]
print(producer_mp_top20)

# print list with top 20 producers by quantity of products
producer_it_top20 = producer_it.iloc[:20]
print(producer_it_top20)

# plot scatter plot
# configure chart
fig, ax = plt.subplots(figsize=(10, 5))

ax.scatter(x=producer_mp_top20['price_usd'].values, y=producer_mp_top20['item_id'].values, c='red', s=70, alpha=0.4)
ax.scatter(x=producer_it_top20['price_usd'].values, y=producer_it_top20['item_id'].values, c='blue', s=20, alpha=0.4)
ax.set_title('Top 20 producers')
ax.set_xlabel('Price median')
ax.set_ylabel('Quantity of products')
ax.legend(['by Median Price', 'Quantity of products'], loc='upper center')

plt.show()
plt.show()

We can observe two overlapping points at the far-right and top of the graph. This suggests that we likely have a producer whose median price is the highest and who also has the highest quantity of products. Below, we will identify these two points to confirm this hypothesis.

# identify top 1 product in average price and top 1 in quantity of products
print('The top 1 producer with the highest median price:      {}\n'
      'The top 1 producer with the largest quantity of products: {}'.format(producer_median_price.index[0], producer_items.index[0]))
'''
The top 1 producer with the highest median price:      Domaine Ponsot
The top 1 producer with the largest quantity of products: Domaine Ponsot
'''

We have confirmed that the producer with the highest median price is also the one with the highest quantity of products. Furthermore, as seen earlier, this producer has the wine with the highest revenue generated for the company and is also the best-selling wine (per bottle).

With this information, several strategies can be considered for the company, such as:

Identifying other producers in the same region as Domaine Ponsot.
Searching for wines similar to Domaine Ponsot Chapelle-Chambertin Grand Cru.
Offering other products from Domaine Ponsot that are not currently available from the company.
Creating experiences related to this producer, such as tasting kits.

These strategies can help the company leverage its strong relationship with this high-revenue and high-sales producer to increase overall profitability and enhance the customer experience.

10.3. Producers: best sellers x highest revenue

Just as we checked the best-selling and highest revenue-generating wines, we can do the same analysis for producers. This is interesting because, with additional information such as customer consumption preferences, it’s possible to create strategies that will consequently increase sales.

For this analysis, I will construct two sets:

The top 20 producers that generated the most revenue.
The top 20 producers with the highest quantity of items sold.

Then, I will create a scatter plot that shows the relationship between the revenue generated and the producers with the highest quantity of units sold. These are two separate sets because when ranked by revenue, a producer may not always be among the top 20 in terms of units sold, and vice versa. This analysis can provide insights into the producers that contribute the most to revenue and sales volume.

# top 20 producers that generated the most revenue
top_p_amount= df_clean.groupby('producer')['total_amount_usd'].sum().sort_values(ascending=False)
top_p_sales = df_clean.groupby('producer')['sales'].sum().sort_values(ascending=False)
top_p_amount_merged= pd.merge(top_p_amount, top_p_sales, on='producer', how='left')
top_p_amount_merged['rank'] = range(1, (len(top_p_amount)+1))

top_p_amount_merged[:20]

# top 20 best-selling producers
top_p_sales = df_clean.groupby('producer')['sales'].sum().sort_values(ascending=False)
top_p_amount= df_clean.groupby('producer')['total_amount_usd'].sum().sort_values(ascending=False)
top_p_sales_merged = pd.merge(top_p_sales, top_p_amount, on='producer', how='left')
top_p_sales_merged['rank'] = range(1, (len(top_p_amount)+1))

top_p_sales_merged[:20]

# plot scatter plot
# the 20 producers with the highest sales x highest revenue
# configure chart

top20_p_amount = top_p_amount_merged.iloc[0:20]
top20_p_sales  = top_p_sales_merged.iloc[0:20]

fig, ax = plt.subplots(figsize=(10, 5))

ax.scatter(x=top20_p_amount['total_amount_usd'].values, y=top20_p_amount['sales'].values, c='red', s=70, alpha=0.4)
ax.scatter(x=top20_p_sales['total_amount_usd'].values, y=top20_p_sales['sales'].values, c='blue', s=20, alpha=0.4)
ax.set_title('Top 20 best-selling producers')
ax.set_xlabel('Revenue')
ax.set_ylabel('Units sold')
ax.legend(['Revenue', 'Units sold'], loc='upper center')

plt.show()

Once again, we can observe two points that differ from the others, at the right end and at the top of the graph. Let’s proceed with confirmation.

# print top 1 producer with the most units sold and the top 1 that generated the most revenue
print('The top 1 producer with the most units sold: {}\n'
      'The top 1 producer that generated the most revenue:     {}'.format(top_p_sales_merged.index[0], top_p_amount_merged.index[0]))
'''
The top 1 producer with the most units sold: Domaine Ponsot
The top 1 producer that generated the most revenue:     Domaine Ponsot
'''

And there it is again: Domaine Ponsot. If we look at the ranking lists above once more, indeed, the gap between Domaine Ponsot and the second-place producer is quite significant.

Additionally, it’s worth noting that Chateau Latour, which ranks second in terms of producers generating the most revenue during the period, does not appear in the list of the top 20 producers with the highest quantity of wines sold. If we revisit the analysis from the previous section, Chateau Latour ranks 5th among producers with the highest median price per bottle (500 USD), but it also does not feature among the top 20 producers with the most products offered. Therefore, we can infer that despite offering a limited number of products and having few units sold, its second position in revenue may be due to the high value of its wines.

10.4. Region: quantity of product x revenue

We can also compare the quantity of products offered and the revenue generated by the region of wine production.

Once again, I will generate two sets of data:

The top 20 regions that generated the most revenue.
The top 20 regions with the highest quantity of items offered.

Finally, I will build a scatter plot that shows the relationship between them.

# top 20 regions that generated the most revenue
top_r_amount= df_clean.groupby('region')['total_amount_usd'].sum().sort_values(ascending=False)
top_r_sales = df_clean.groupby('region')['item_id'].nunique().sort_values(ascending=False)
top_r_amount_merged= pd.merge(top_r_amount, top_r_sales, on='region', how='left')
top_r_amount_merged['rank'] = range(1, (len(top_r_amount)+1))

top_r_amount_merged[:20]

# top 20 best-selling producers
top_r_sales = df_clean.groupby('region')['item_id'].nunique().sort_values(ascending=False)
top_r_amount= df_clean.groupby('region')['total_amount_usd'].sum().sort_values(ascending=False)
top_r_sales_merged = pd.merge(top_r_sales, top_r_amount, on='region', how='left')
top_r_sales_merged['rank'] = range(1, (len(top_r_amount)+1))

top_r_sales_merged[:20]

# plot scatter plot
# the 20 regions with the highest sales x highest revenue
# configure chart
fig, ax = plt.subplots(figsize=(10, 5))

ax.scatter(x=top_r_amount_merged['total_amount_usd'].values, y=top_r_amount_merged['item_id'].values, c='red', s=70, alpha=0.4)
ax.scatter(x=top_r_sales_merged['total_amount_usd'].values, y=top_r_sales_merged['item_id'].values, c='blue', s=20, alpha=0.4)
ax.set_title('Top 20 best-selling regions')
ax.set_xlabel('Revenue')
ax.set_ylabel('Units sold')
ax.legend(['Revenue', 'Units sold'], loc='upper center')

plt.show()

Now, we can see four points that stand out, with two pairs of points overlapping. By checking the rankings above, we can confirm that the top two regions generating the most revenue are also the top two regions with the highest quantity of different products available. It’s worth noting that the correlation between items and region had already been identified during the data analysis.

Another interesting point is that Ribera del Duero, with only 5 products, ranks 3rd in terms of revenue generated. This indicates that despite having a limited number of products, this region is still performing well in terms of revenue.

11. Demand Forecast

For this section, the following steps are required:

Data preparation to meet the requirements of the libraries.
Checking the time series (stationary or non-stationary).
- If non-stationary, apply the necessary transformations.
Splitting the dataset into training and validation sets.
Making predictions using different algorithms.

11.1. Data Preparation

In most libraries like Prophet and PyCaret, the dataset should contain only date information and the values, which in our case are the sales. Additionally, for Prophet, the attributes need to be named in a specific format, so I’ll format them accordingly.

# create variable 'df_ts' to receive formatted data
# insert sales data, per day (so sum)
df_ts = df_clean.groupby('date',as_index=False)['sales'].sum()

# verify
df_ts.head()

# rename attributes
df_ts.columns = ['ds', 'y']

# set data as index
df_ts.index = pd.to_datetime(df_ts['ds'], format="%Y-%m-%d")

# remove data attribute that is repeated
df_ts.drop('ds', axis=1, inplace=True)

# check changes
df_ts.head()

11.2. Stationary Series

In order to use our data for making forecasts, it’s a requirement for some algorithms that we work with a stationary time series. This means that the dataset should have statistical properties that remain constant over time. In other words, the mean, variance, and covariance should remain constant over the time interval (Image 1). This is because most statistical methods assume the premise of working with a stationary time series for calculations.

Image 1 — A stationary series should resemble the graph above

Time Series are data distributed at sequential and regular intervals over time.

Due to the time dependency of the variable, traditional techniques are not suitable for this type of problem. Failing to consider the time dependency can alter the meaning of the data and lead to incorrect conclusions.

11.2.1. Augmented Dickey-Fuller (ADF) Test

To check if a particular dataset is a stationary time series, there are various methods that can be applied. However, statistical tests are the most reliable. Among them, the most well-known is the Augmented Dickey-Fuller (ADF) test, which we will consider for hypothesis testing here:

Null Hypothesis (H0) ▶ p-value> 0.05
The series is not stationary; therefore, the data has some time dependency.
Alternative Hypothesis (H1) ▶ p-value<= 0.05
Rejects the null hypothesis, meaning the series is stationary.

Note

The significance level (p-value) is a value determined according to each case to establish a threshold for accepting or rejecting the null hypothesis. Thus, the smaller the p-value, the stronger the evidence against the null hypothesis. For example, if the p-value result were 0.04, we could reject the null hypothesis and claim that the series is stationary with a confidence level of 96%.

Therefore, if the value is above the threshold, we cannot reject the null hypothesis, but this is not an absolute truth. It simply means that we did not find enough evidence to reject it.

# extract only the values
X = df_ts.y

# apply ADF
result = adfuller(X)

# print results
print('Augumented Dickey-Fuller')
print('Statistical Test: {:.4f}'.format(result[0]))
print('P-Value: {:.10f}'.format(result[1]))
print('Critical Values: ')
for key, value in result[4].items():
 print('\t{}: {:.4f}'.format(key, value))
'''
Augumented Dickey-Fuller
Statistical Test: -2.3601
P-Value: 0.1533303754
Critical Values: 
 1%: -3.4365
 5%: -2.8642
 10%: -2.5682
'''

The value calculated for the p-value of this set is 0.1533, therefore, above the threshold of 0.05 that was determined. Therefore, we need to transform this series into stationery.

11.2.2. Transform Series

The process of transforming the series into stationary data requires removing the trend and seasonality in the data. There are various ways to do this. In this case, I’ll start by applying a logarithm to reduce the magnitude of the series values, and then I’ll subtract the moving average of a certain period from the log of the series.

I’ll check the series again by applying the ADF test, and then I’ll also apply the differencing method. After that, I’ll perform another ADF test to confirm stationarity.

To begin, let’s plot the raw data to visually assess its original state. Since we’re dealing with daily data, I’ll use a 30-day moving average, which corresponds to one month.

# plot original time series
## set 30 period moving average
ma = df_ts.rolling(30).mean()

## configure chart
fig, ax = plt.subplots(figsize=(10,5))
df_ts.plot(ax=ax, c='#004a8f', legend=False)
ma.plot(ax=ax, c='#E3F700', legend=False)
plt.tight_layout();

Now, I will apply the log in order to reduce the magnitude, i.e., the asymmetry of our data. Next, I will subtract the moving average of the 30-day period from the log of the time series.

# appply log
df_log = np.log(df_ts)

# calculate log of moving average of 30
ma_log = df_log.rolling(30).mean()

# plot graph
fig, ax = plt.subplots(figsize=(10,5))
df_log.plot(ax=ax, c='#004a8f', legend=False)
ma_log.plot(ax=ax, c='#E3F700', legend=False)
plt.tight_layout();

Note that in the previous graph, without applying the log, the y-axis went up to 52 thousand. Now, we can see above that the log has reduced the y-axis considerably: to 10.85.

# subtract the log mean of the data
df_sub = (df_log - ma_log).dropna()
# note: dropna is necessary because the first 30 values will be blank
# and this blank data can harm the transformation
ma_sub = df_sub.rolling(30).mean()
# standard deviation
std_sub = df_sub.rolling(30).std()

# plot graph
fig, ax = plt.subplots(figsize=(10,5))
df_sub.plot(ax=ax, c='#004a8f', legend=False)
ma_sub.plot(ax=ax, c='#E3F700', legend=False)
std_sub.plot(ax=ax, c='#F57169', legend=False)
plt.tight_layout();

We will repeat the ADF test to verify that we now have a stationary series.

# extract only the values
X_sub = df_sub.y

# apply ADF
result_sub = adfuller(X_sub)

# print results
print('Augumented Dickey-Fuller')
print('Statistical Test: {:.4f}'.format(result_sub[0]))
print('P-Value: {:.10f}'.format(result_sub[1]))
print('Critical Values: ')
for key, value in result_sub[4].items():
 print('\t{}: {:.4f}'.format(key, value));
'''
Augumented Dickey-Fuller
Statistical Test: -4.8494
P-Value: 0.0000437098
Critical Values: 
 1%: -3.4366
 5%: -2.8643
 10%: -2.5682
'''

With a p-value result of 0.0000437, we can now reject the null hypothesis and consider the series as stationary. Remember that the previous p-value obtained was 0.1533, which was above the threshold of 0.05.

Differencing is another technique that helps turn a time series into stationary data. In differencing, we calculate the difference between two possible observations. So, this calculation is given by:

In other words, given one observation, we take the immediately preceding observation and subtract it.

# apply differentiation

# diff = 1
# because it is 1 period that we want to calculate the differentiation
df_diff = df_sub.diff(1)
ma_diff = df_diff.rolling(30).mean()
std_diff = df_diff.rolling(30).std()

fig, ax = plt.subplots(figsize=(10,5))
df_diff.plot(ax=ax, c='#004a8f', legend=False)
ma_diff.plot(ax=ax, c='#E3F700', legend=False)
std_diff.plot(ax=ax, c='#F57169', legend=False)
plt.tight_layout();

# extract only the values
X_diff = df_diff.y.dropna().values

# apply ADF
result_diff = adfuller(X_diff)

# print results
print('Augumented Dickey-Fuller')
print('Statistical Test: {:.4f}'.format(result_diff[0]))
print('P-Value: {:.20f}'.format(result_diff[1]))
print('Critical Values: ')
for key, value in result_diff[4].items():
 print('\t{}: {:.4f}'.format(key, value));
'''
Augumented Dickey-Fuller
Statistical Test: -10.9003
P-Value: 0.00000000000000000012
Critical Values: 
 1%: -3.4366
 5%: -2.8643
 10%: -2.5682
'''

Please note how we were able to reduce the p-value to the point where we had to increase the number of decimal places from 10 to 20 to get a sense of its value.

Below, I’ll plot the graphs constructed above to visually check the transformation of the data into a time series at each stage. Additionally, I’ll display the p-value obtained in each of them using the ADF test.

# configure chart
fig, (ax0, ax1, ax2) = plt.subplots(nrows=1, ncols=3, figsize=(20,5))

# original data
ax0.set_title('Original Data\n ADF {:.4}'.format(result[1]))
ax0.plot(df_ts, c='#004a8f')
ax0.plot(ma, c='#E3F700')

# log application
ax1.set_title('Log Application\n ADF {:.4}'.format(result_sub[1]))
ax1.plot(df_sub, c='#004a8f')
ax1.plot(ma_sub, c='#E3F700')
ax1.plot(std_sub, c='#F57169')

# application of differentiation
ax2.set_title('Differentiation\n ADF {:.4}'.format(result_diff[1]))
ax2.plot(df_diff, c='#004a8f')
ax2.plot(ma_diff, c='#E3F700')
ax2.plot(std_diff, c='#F57169')

plt.tight_layout();

11.3. Split the set into training and validation data

We aim to create a generic model that can handle real-world data as effectively as possible (without underfitting or overfitting). That’s why we divided our dataset into two subsets: one for training and the other for validation. In this way, the training data will be used for the model to learn, and the validation data will be used to assess the model’s performance. Moreover, this split should occur randomly to avoid a biased sample.

There is no fixed rule for the size of the split, but we need to be aware that the longer the time we are trying to predict, the greater the chances of the model making mistakes. Additionally, the time to be predicted will depend on how these data are used within the company. That’s why I considered a 120-day forecast to be useful in a quarterly projection for the company.

First, let’s check our time series after the transformations.

# check time series
df_diff.info()
'''
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1067 entries, 2018-01-30 to 2020-12-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       1066 non-null   float64
dtypes: float64(1)
memory usage: 16.7 KB
'''

Note that we have 1067 entries and 1066 of them are filled. Therefore, we have null data. I will remove it so as not to interfere with the predictions.

# remove null data
df_diff.dropna(inplace=True)

# check time series
df_diff.info()
'''
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1066 entries, 2018-01-31 to 2020-12-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       1066 non-null   float64
dtypes: float64(1)
memory usage: 16.7 KB
'''

If we have 1066 data points, then our training set will contain data from 0 to 945, while the validation set will contain the last 30 days, that is, from records 946 to 1066.

# split set into training='train' and validation='valid'
train = df_diff[:946]
valid = df_diff[946:]

# checking training data
print('Training Data\n'
      '\t start: {}\n'
      '\t end:    {}'.format((train.index[0]), (train.index[-1])))

# verification of validation data
print('\nValidation Data\n'
      '\t start: {}\n'
      '\t end:    {}'.format((valid.index[0]), (valid.index[-1])))
'''
Training Data
  start: 2018-01-31 00:00:00
  end:    2020-09-02 00:00:00

Validation Data
  start: 2020-09-03 00:00:00
  end:    2020-12-31 00:00:00
'''

11.4. Make predictions

For comparison purposes, different forecasts will be made using the techniques listed below so that we can assess their performance.

Naive approach¹
Moving Average¹
Holt’s Linear Trend Model¹
ARIMA²
Prophet²
PyCaret¹

[¹] These techniques will use the clean and processed data, i.e., without the series being transformed into stationary data: ds_ts, train_ts, and valid_ts.

[²] These techniques will use the treated data, i.e., with the series converted into stationary data: df_diff, train, and valid.

Let’s review the formatting of the clean and processed dataset, called df_ts, and prepare it, i.e., split it into training and validation data.

# print first 5 entries of set 'df_ts'
df_ts.head()

# check time series
df_ts.info()
'''
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1096 entries, 2018-01-01 to 2020-12-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   y       1096 non-null   int64
dtypes: int64(1)
memory usage: 17.1 KB
'''

# split set into training='train_ts' and validation='valid_ts'
train_ts = df_ts[:976]
valid_ts = df_ts[976:]

# copy validation data
y_hat_ts = valid_ts.copy()

# checking training data
print('Training Data\n'
      '\t start: {}\n'
      '\t end:    {}'.format((train_ts.index[0]), (train_ts.index[-1])))

# verification of validation data
print('\nValidation Data\n'
      '\t start: {}\n'
      '\t end:    {}'.format((valid_ts.index[0]), (valid_ts.index[-1])))
'''
Training Data
  start: 2018-01-01 00:00:00
  end:    2020-09-02 00:00:00

Validation Data
  start: 2020-09-03 00:00:00
  end:    2020-12-31 00:00:00
'''

11.4.1. Naive Approach

This method makes predictions based on the last available data point. It serves as an initial baseline, a reference value for future improvements. It’s a straightforward inference to address the problem, and we know that we can improve upon it.

# copy the last value in the 'train' set and assign it to 'y_hat' in a new 'naive' attribute
y_hat_ts['naive'] = train_ts.iloc[-1].values[0]

# check changes
y_hat_ts.head()

# plot graph
fig, ax = plt.subplots(figsize=(8,4))
train_ts.plot(ax=ax, c='#004a8f')
valid_ts.plot(ax=ax, c='#91aac9')

# create forecast line with y_hat['naive']
y_hat_ts['naive'].plot(ax=ax, c='#F422D1', linewidth=1.8)

# legends
ax.legend(['Train', 'Valid', 'Naive'])

plt.show()

11.4.2. Moving average

In the Moving Average approach, we use a customized time window for prediction. This way, moving averages smooth out curves and reduce noise (dispersion).

Let’s make a forecast for our time series based on a 3-day moving average. Instead of analyzing just one data point like in the Naive approach, we’ll consider two additional data points for aggregation.

# calculate the moving average of the last 3 values available in the set
# save to a new attribute called 'mm3'
y_hat_ts['mm3'] = train_ts.y.rolling(3).mean().iloc[-1]

# check changes
y_hat_ts.head()

# plot graph
fig, ax = plt.subplots(figsize=(8,4))
train_ts.plot(ax=ax, c='#004a8f')
valid_ts.plot(ax=ax, c='#91aac9')

# create forecast line with y_hat['mm3']
y_hat_ts['mm3'].plot(ax=ax, c='#F422D1', linewidth=1.8)

# legends
ax.legend(['Train', 'Valid', 'Média Móvel 3'])

plt.show()

11.4.3. Holt’s Linear Trend Model

Note that both the Naive approach and the Moving Average are straight lines. This happens because the sense of trend is lost. In this regard, the forecasting method using Holt’s manages to bring this parameter back because it works not only with the level, as the previous techniques do but also with the trend of the series, which leads to better results compared to the previous ones.

# save time series components to a variable
result = seasonal_decompose(train_ts)

# plot components
fig, (ax0, ax1, ax2, ax3) = plt.subplots(4, 1, figsize=(8,10))
result.observed.plot(ax=ax0, c='#91aac9')
result.trend.plot(ax=ax1, c='#004a8f')
result.seasonal.plot(ax=ax2, c='#91aac9')
result.resid.plot(ax=ax3, c='#91aac9')

plt.tight_layout()

We can see above, in darker blue, the graph that shows the trend of this time series.

# generate the Holt values for this series (training data = 'train')
# save them in a new variable 'holt'
# smoothing_level= height adjustment
# smoothing_slope= tilt adjustment
# make prediction according to validation set size
y_hat_ts['holt'] = Holt(train_ts.y).fit(smoothing_level=0.005,
                                        smoothing_slope=0.01).forecast(len(valid_ts))

# check changes
y_hat_ts.head()

# plot graph
fig, ax = plt.subplots(figsize=(8,4))
train_ts.plot(ax=ax, c='#004a8f')
valid_ts.plot(ax=ax, c='#91aac9')

# create forecast line with y_hat['holt']
y_hat_ts['holt'].plot(ax=ax, c='#F422D1', linewidth=1.8)

# legends
ax.legend(['Train', 'Valid', 'Holt'])

plt.show()

11.4.4. ARIMA

This is one of the most widely used models for time series forecasting, due to its ability to capture different sets of temporal structures within the series.

Its acronym ARIMA stands for the three values needed to create the model:

Auto Regression (p)
Integrated (d)
Moving Average (q)

It’s worth noting that for ARIMA, the series needs to be stationary. To achieve this, we will use the training data train and the validation data valid derived from the splitting of the df_diff dataset.

# instantiate model
model = auto_arima(train, # training dataset
                   suppress_warnings=True)

# train the model
model.fit(train)

# make predictions
forecast = model.predict(n_periods=len(valid))
forecast = pd.DataFrame(forecast,index = valid.index,columns=['Prediction'])

# plot graph
fig, ax = plt.subplots(figsize=(8,4))
train.plot(ax=ax, c='#004a8f')
valid.plot(ax=ax, c='#91aac9')

# create forecast line with 'forecast'
forecast.plot(ax=ax, c='#F422D1', linewidth=1.8)

# legends
ax.legend(['Train', 'Valid', 'ARIMA'])

plt.show()

I will plot a graph with just the data that was predicted by the model versus the actual data.

# plot graph
# concatenate prediction and validation data
ax = pd.concat([valid, forecast], axis=1).plot()

# set line colors
ax.lines[0].set_color('#91aac9')  # 'valid'
ax.lines[1].set_color('#004a8f')   # 'forecast'

# legends
ax.legend(['Valid', 'ARIMA'])

plt.show();

11.4.5. Prophet

The Prophet library, developed by the company Meta (formerly known as Facebook), is simple, and easy to use, and yet it manages to be a very robust tool.

It’s worth noting that for Prophet, the series needs to be stationary. To achieve this, we will use the training data train and the validation data valid(resulting from the division of the df_diff dataset).

# transform data into Prophet format
train_p = train.copy()
train_p.reset_index(inplace=True)
valid_p = valid.copy()
valid_p.reset_index(inplace=True)

# instantiate model
b = Prophet()

# train the model
b.fit(train_p)

# create the data frame
future = b.make_future_dataframe(periods=len(valid_p))

# make predictions with the 'future' data frame (created in the previous item)
forecast_p = b.predict(future)

# print
forecast_p.head()

As we can see above, the Prophet generates various pieces of information as a result. The forecast is located in the yhat attribute. In yhat_lower and yhat_upper contain the lower and upper bounds data, respectively.

Below, I will generate the visualization of the forecasts.

# plot predictions
b.plot(forecast_p).savefig('forecast.png')

With Prophet, we can easily plot the components of our forecast, separated into 3 graphs:

Overall Trend
Weekly Trend
Yearly Trend

# plot forecast components
# general, weekly, annual
b.plot_components(forecast_p).savefig('components.png')

Just like we saw in ARIMA, I’m going to plot the validation data and the predicted data here so we can compare them.

# plot graph
fig, ax = plt.subplots()
ax.plot(valid_p.ds, valid_p.y, c='#91aac9')
ax.plot(forecast_p.ds[946:], forecast_p.yhat[946:], c='#004a8f')

# legends
ax.legend(['Valid', 'Prophet'], loc='lower left')

# set x-axis font size
ax.tick_params(axis='x', labelsize=6)

plt.tight_layout();

Although the visualization already gives us some information, it is worth plotting another visualization in which we can also check the limits predicted by the model, given by yhat_lower and yhat_upper.

# define data comparison function
def make_comparison_dataframe(historical, forecast):
  return forecast_p.set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']].join(historical)

# compare actual data with forecast
cmp_df = make_comparison_dataframe(df_diff, forecast_p)

# view results
fig, ax = plt.subplots(figsize=(10,5))

# data to plot
plt.plot(cmp_df['yhat'])
plt.plot(cmp_df['yhat_lower'])
plt.plot(cmp_df['yhat_upper'])
plt.plot(cmp_df['y'])

# settings
ax.plot(cmp_df['yhat_lower'], c='#A826FF', ls='--', linewidth=0.5, label='yhat_lower')
ax.plot(cmp_df['yhat_upper'], c='#F422D1', ls='--', linewidth=0.5, label='yhat_upper')
ax.plot(cmp_df['y'], c='#bfbfbf', label='y')
ax.plot(cmp_df['yhat'], c='#004a8f', linewidth=0.8, label='yhat')

# legends
ax.legend()

plt.tight_layout()

In the graph above, it can be observed that the model is at a midpoint of the data, with the lower limit more tightly adjusted than the upper limit. If you want to make more optimistic forecasts, you should use yhat_upper as a reference. On the other hand, for more pessimistic forecasts, you can use yhat_lower.

11.4.6. PyCaret

PyCaret is a user-friendly library that assists in the process of building machine-learning models through automated functions for common steps in model development. This makes it an extremely useful tool for quickly constructing machine learning models.

It’s important to note that auto-machine learning tools like this do not replace the work of a data scientist. Their use should be aimed at supporting and expediting the model-building process.

I am going to use the original, clean, and processed data here, meaning that the time series has not been transformed into a stationary one. This way, we avoid an additional step that some algorithms can already handle automatically.

The process involves 7 steps:

Configuration
Model creation and comparison
Creation of the best model found in the previous step
Hyperparameter tuning, according to the MAPE evaluation metric
Model visualization
Prediction on the test set
Model finalization

From there, we can make the final prediction and generate a graph to visualize the predicted value by the model and the actual data.

The first step consists of providing the necessary parameters for model configuration.

# 1. configure the time series
s = setup(train_ts,           # data set
          fh = 120,           # number of points to be predicted
          session_id = 123)   # seed to generate reproducible results

Now, all models available in the PyCaret library will be created and trained, and then evaluated. The results are ranked and the best model for each metric is highlighted.

# 2. create and compare models
best = compare_models()

Above, 27 different models were created and tested. Among them, the ones that showed the best results were the Huber w/ Cond. Deseasonalize & Detrending and the K Neighbors w/ Cond. & Deseasonalize Detrending.

We can confirm the best algorithm created with the following code:

# 3. create model with the best detected algorithm
# identify the best model
print(best)
'''
BaseCdsDtForecaster(fe_target_rr=[WindowSummarizer(lag_feature={'lag': [7, 6, 5,
                                                                        4, 3, 2,
                                                                        1]},
                                                   n_jobs=1)],
                    regressor=HuberRegressor(), sp=7, window_length=7)
'''

Therefore, let’s continue using Huber w/ Cond. Deseasonalize & Detrending.

# instantiate and create the model with the best detected algorithm
huber = create_model('huber_cds_dt')

Note that the averages of the metrics shown above are the same for the model created in the list of all algorithms. This is because the model is created and evaluated in the same way as it was done previously.

# 4. adjust hyperparameters according to the MAPE value
tuned_huber = tune_model(huber, optimize='MAPE')

The average obtained with MAPE was slightly higher here than the unadjusted model, however, note that in 2 of the 3 evaluations, the model performed better. Therefore, I will proceed with the adjusted model.

Let’s visualize the predicted data versus actual data in the graph below.

# 5. view model
# plot graph with predictions
plot_model(tuned_huber, plot='forecast')

Before finalizing the model, we perform a check in which we make predictions using the test dataset separated during the setup function in the initial model creation step. This way, we check for any discrepancies in the results presented.

# 6. make predictions on the test set
holdout_pred = predict_model(tuned_huber)

The model performed slightly better than the result obtained in step 4, from 0.323 to 0.0318 and within the standard deviation of 0.0074.

In the final step of the model creation process, we conclude it with the finalize_model function, which will train the model with the complete dataset that was specified during the setup function. In other words, it will include the dataset that was initially separated for testing.

# 7. finalize model
final_huber = finalize_model(tuned_huber)

We can print the final model to check the parameters used in the model.

# check parameters
print(final_huber)
'''
ForecastingPipeline(steps=[('forecaster',
                            TransformedTargetForecaster(steps=[('model',
                                                                BaseCdsDtForecaster(fe_target_rr=[WindowSummarizer(lag_feature={'lag': [7,
                                                                                                                                        6,
                                                                                                                                        5,
                                                                                                                                        4,
                                                                                                                                        3,
                                                                                                                                        2,
                                                                                                                                        1]},
                                                                                                                   n_jobs=1)],
                                                                                    regressor=HuberRegressor(),
                                                                                    sp=7,
                                                                                    window_length=7))]))]
'''

Now we can proceed with the prediction.

# make prediction
forecast_pycaret = predict_model(final_huber, fh=120)

# check the latest set entries to confirm dates
forecast_pycaret.tail()

# create attribute 'ds' with date values
forecast_pycaret['ds'] = forecast_pycaret.index.to_timestamp()

# rename attributes
forecast_pycaret.columns = ['y', 'ds']

# set 'ds' as index
forecast_pycaret.index = forecast_pycaret.ds

# delete attribute 'ds' that was duplicated
forecast_pycaret.drop('ds', axis=1, inplace=True)

# check changes
forecast_pycaret.head()

With the forecast data, I will plot it on a graph next to the actual data so we can compare it.

# plot graph
# concatenate prediction and validation data
ax = pd.concat([valid_ts, forecast_pycaret], axis=1).plot()

# set line colors
ax.lines[0].set_color('#91aac9')  # 'valid'
ax.lines[1].set_color('#004a8f')   # 'forecast'

# legends
ax.legend(['Valid', 'PyCaret'])

plt.show();

12. Model Assessment

The final step in model construction is performance evaluation, which assesses how well a model made accurate and inaccurate predictions.

To do this, I will present two subsections: the first with graphs for a quick analysis of the models, and the second with two evaluation metrics that measure the errors of the models, providing a statistical and therefore more precise value to determine which model performed the best.

12.1. Reverse transformations for stationary series

Before proceeding with the evaluation, as in the ARIMA and Prophet models, we used data with log transformation, moving average summation, and differencing to make it a stationary series. To compare error values with other models, we should reverse this treatment; otherwise, we would be comparing different dimensions.

First, I will prepare the Prophet data by separating only the predicted data and adjusting the index to the dates in the prediction and validation sets.

# create variable with Prophet forecast data
forecast_prophet = forecast_p[['ds', 'yhat']][946:]

# configure dates as index
forecast_prophet.set_index('ds', inplace=True)
valid_p.set_index('ds', inplace=True)

We then proceed with the reversal of differencing, moving average summation, and log transformation. In order to obtain the necessary outputs (columns named ds and y), it was necessary to rename the attributes and also remove any null data to avoid errors.

# reverse differentiation
valid_revert = valid.cumsum()
valid_p_revert = valid_p.cumsum()
forecast_revert = forecast.cumsum()
forecast_p_revert = forecast_prophet.cumsum()

# rename attributes
forecast_revert.columns = ['y']
forecast_p_revert.columns = ['y']

# reverse subtraction from mean
valid_revert = valid_revert + ma_log
valid_p_revert = valid_p_revert + ma_log
forecast_revert = forecast_revert + ma_log
forecast_p_revert = forecast_p_revert + ma_log

# revert log
valid_revert = np.exp(valid_revert)
valid_p_revert = np.exp(valid_p_revert)
forecast_revert = np.exp(forecast_revert)
forecast_p_revert = np.exp(forecast_p_revert)

# remove null data
valid_revert.dropna(inplace=True)
valid_p_revert.dropna(inplace=True)
forecast_revert.dropna(inplace=True)
forecast_p_revert.dropna(inplace=True)

# check reversals
valid_revert.head()

12.2. Charts

A simple way to evaluate the models created is through visual means, that is, graphs. Below, I plotted the Naive, Moving Average, and Holt’s models together and with the real data so we can compare them.

# configure chart
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title('Predictions x Validation')
ax.plot(train_ts, c='#bfbfbf')
ax.plot(valid_ts, c='#91aac9')

# Naive approach
ax.plot(y_hat_ts['naive'], c='#F422D1', linewidth=1.8)

# moving average 3
ax.plot(y_hat_ts['mm3'], c='#004a8f', linewidth=1.8)

# holt's
ax.plot(y_hat_ts['holt'], c='#A826FF', linewidth=1.8)

# legends
ax.legend(['Train', 'Valid', 'Naive', 'Moving Average 3', 'Holt'])

plt.tight_layout();

We can see above how the moving average is slightly more centered relative to the data, and how the Holt model, with its slope, is correctly positioned in the direction of the sales trend.

In a second graph, I bring the forecasts of the ARIMA, Prophet, and PyCaret models (as well as the actual data), as they try to predict point by point, making them more complex. Plotting them alongside the models above would result in a lot of visual clutter, making it difficult to understand. This way, we have a graph that is easier to comprehend and compare similar models.

# configure chart
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title('Predictions x Validation')

ax.plot(valid_revert, c='#bfbfbf')

# ARIMA
ax.plot(forecast_revert.y, c='#F422D1', linewidth=1.5)

# Prophet
ax.plot(forecast_p_revert.y, c='#A826FF', linewidth=1.5)

# PyCaret
ax.plot(forecast_pycaret.y, c='#FB8B24', linewidth=1.5)

# legends
ax.legend(['Validation', 'ARIMA', 'Prophet', 'PyCaret'], loc='lower left')

# set x-axis font size
ax.tick_params(axis='x', labelsize=6)

plt.tight_layout();

Looking at the graph above, we can see that initially, the models are very similar, until Prophet diverges around the point of 2020–11–01. PyCaret doesn’t seem to follow the trend as closely as ARIMA and Prophet. Therefore, it appears that visually, the Prophet model fits the actual data better. But let’s confirm this further in the following steps.

12.3. Assessment Metrics

To evaluate the models created earlier and identify which one performs best, two metrics will be used: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE).

MAE is the absolute value of the error in the prediction relative to the actual series, calculated by taking the mean of the absolute values of the error magnitudes. Therefore, the lower its value, the better the model. Its calculation is given by:

On the other hand, MAPE shows how much the predictions differ from the actual value in percentage terms, i.e., it is the percentage equivalent of MAE.

# evaluate Naive approach
naive_mae = mae(y_hat_ts.y, y_hat_ts.naive)
naive_mape = mape(y_hat_ts.y, y_hat_ts.naive)

# evaluate moving average 3
mm3_mae = mae(y_hat_ts.y, y_hat_ts.mm3)
mm3_mape = mape(y_hat_ts.y, y_hat_ts.mm3)

# evaluate holt's
holt_mae = mae(y_hat_ts.y, y_hat_ts.holt)
holt_mape = mape(y_hat_ts.y, y_hat_ts.holt)

# evaluate arima
arima_mae = mae(valid_revert, forecast_revert)
arima_mape = mape(valid_revert, forecast_revert)

# evaluate prophet
prophet_mae = mae(valid_p_revert, forecast_p_revert)
prophet_mape = mape(valid_p_revert, forecast_p_revert)

# evaluate pycaret
pycaret_mae = mae(valid_ts, forecast_pycaret)
pycaret_mape = mape(valid_ts, forecast_pycaret)

print(f'{"Assessment Metrics":^40}')
print('-'*40)
print('\t\t   MAE \t\t  MAPE')
print('Naive Approach:  {:.2f} \t {:.4f}\n'
      'Moving Average 3:{:.2f} \t {:.4f}\n'
      'Holt:            {:.2f} \t {:.4f}\n'
      'ARIMA:           {:.2f} \t {:.4f}\n'
      'Prophet:         {:.2f} \t {:.4f}\n'
      'PyCaret:         {:.2f} \t {:.4f}\n'.format(naive_mae, naive_mape,
                                                   mm3_mae, mm3_mape,
                                                   holt_mae, holt_mape,
                                                   arima_mae, arima_mape,
                                                   prophet_mae, prophet_mape,
                                                   pycaret_mae, pycaret_mape))
'''
           Assessment Metrics           
----------------------------------------
     MAE     MAPE
Naive Approach:  1631.18   0.0357
Moving Average 3:1428.87   0.0309
Holt:            1370.60   0.0290
ARIMA:           1089.11   0.0233
Prophet:         1346.82   0.0287
PyCaret:         1756.81   0.0383
'''

We can see that the best model generated was by the ARIMA algorithm, with an error rate of only 2.33%. Although Prophet wasn’t far behind, with a 2.87% error rate. It’s worth noting that the execution time of Prophet is much faster compared to ARIMA, which can be an important consideration depending on the use case.

Furthermore, the result obtained by Holt, 2.90%, is noteworthy. Since it is a simpler algorithm and achieved a result very close to Prophet, it could be an excellent alternative depending on the practical application.

13. Conclusion

The central objective of this study was to develop an algorithm capable of predicting the demand for wines in a specialized store for this product.

By using the Pandas Profiling library, it was possible to obtain an overview of the dataset in just a few seconds, generating a comprehensive report that aided in exploratory analysis.

After proper data preprocessing, including feature engineering that broke down temporal information into 7 additional variables, as well as the addition of 2 variables related to company revenue, various analyses were conducted to understand the business. Then, we proceeded to develop predictive algorithms for demand.

Since we were dealing with a non-stationary time series, according to the Augmented Dickey-Fuller (ADF) test, to achieve better results with some algorithms, we transformed the series into a stationary one using techniques such as applying a logarithm to reduce the magnitude of values, subtracting a 30-period moving average, and differencing the data.

The forecasting was done for a parameter of 120 days, and the following methods were used: Naive approach, Moving Average, Holt’s Linear Trend Model, ARIMA, Prophet, and PyCaret. In the case of PyCaret, 27 preliminary models were generated, from which the best one was selected to proceed with hyperparameter tuning and data forecasting.

As a result, according to MAPE, the model generated by ARIMA performed the best, with an error rate of only 2.33%, followed by Prophet with 2.87% and Holt with 2.90%.

When applying the model, it’s important to consider the specific use case, as some algorithms run more efficiently than others. The choice should take into account the trade-off between accuracy and computational resources or runtime.

Get to know more about this study

This study is available on Google Colab and on GitHub. Just click on the images below to be redirected.