Capacity Modeling Using Python Step 1: Demand Forecasting

9 min readJul 9, 2022

This article is inspired by a previous article I wrote about regarding the power of using Dash and Python to understand an organization’s capacity by building a capacity model (Power of Capacity Planning with Dash Using Python). I believe that an organization can achieve many of the benefits of a ‘big-box’ capacity/planning solution (the ability to model complicated relationships between demand and supply, ease of use, and quality of insights provided by the model) without the high cost of licensing/customization and the long lead-time to implement. An organization using Python in combination with a few well maintained and developed libraries like Pandas and Dash can quickly (in a few months) design and program a workable application. This quick development time is due to the ease of use of Python and the simple APIs that these libraries expose to developers. More than likely this first iteration of the application will generate enough business value that senior leaders in an organization will provide additional resources to expand the breadth and depth of the model to accelerate its impact. One of the key inputs to a capacity model is a demand forecast, which this article will cover more in depth.

Demand forecasting is one of the key inputs (alongside supply planning) to a capacity model. In laymen’s terms demand forecasting boils down to an organization predicting future customer demand over a specific time period. Once an organization has a prediction, this can be matched up to the supply plan to understand where demand and supply do not match up in future time periods. To make this article relatable I am going to be using a fictional company called Acme Bike which manufactures bicycles in the USA in the future (data from 2021–2030). To understand and model future demand we will follow the below steps:

Clean demand data
Perform exploratory data analysis
Generate forecasting model(s)
Select one forecasting model to predict demand

1. Clean demand data

The first step to this process is actually obtaining the data. In most organizations this consists of pulling data from a database. In this hypothetical situation we already have a pre-built SQL query that was used to pull manufacturing order and defect history. Instead of customizing it (which we probably should to limit the data we pull in) we are going to use the pre-built query.

A sample of SQL code we would need to run to obtain demand data is below for a fictional database table called “orders_database” (for all dates we have in our repository of orders):

SELECT "Inspect ID", "Style Code", "Order Number", "Line Number", "Quantity", "User Code", "Defect Found", "Created Date", "Maint Date"
FROM orders_database
ORDER BY "Created Date"

The data that we pulled will be in saved a .csv file. If you want to follow along, the data is on github at: https://github.com/johnkangw/medium_article_data/blob/3e00cf0c3829f29e05b4330330fb156cb3a86b17/bike_data.csv. This data is repurposed sample data for analyzing quality in a manufacturing and assembly operation provided to me by a colleague. He provided permission to use it for this article.

The data we pulled is not great for analysis because there are too many columns, the date columns are not formatted to datetime data-type, and the names of the columns are not that descriptive. To clean the data we will pull it into a Jupyter notebook and clean it using Pandas:

# Import the Pandas library
import pandas as pd# Read in and save the raw data
bike_data = pd.read_csv(r'C:\Users\bike_data.csv')# Clean the data using Pandas methods that are chained together
bike_data_clean = bike_data.\# Drop extra columns that are not needed
    drop(['Unnamed: 0', 'Inspect ID', 'Style Code', 'Order Number', 'User Code', 'Defect Found', 'Maint Date'], axis='columns').\# Rename column names
    rename(columns={'Line Number': 'Product', 'Quantity': 'Quantity Sold'}).\# Create a new column and assign to a datetime format using the 
# "Created Date" column
    assign(created_date = lambda x: pd.to_datetime(x['Created Date'], format ='%Y%m%d')).\# Drop the original "Created Date" column
    drop(['Created Date'], axis='columns')

The above code cleans our data and prepares it for analysis (the method chaining method is explained really well by Matt Harrison who educates his readers on this way of chaining pandas methods/functions in his book Effective Pandas at https://store.metasnake.com/effective-pandas-book). The first method is a drop method which drops unused columns, the second method renames a few columns, the third method creates a new column formatted in the datetime data-type, and the last method drops a column.

2. Perform Exploratory Data Analysis

In this second step, we are exploring the data to understand it. This section could be a full series of articles on the topic, but in the interest of brevity and accomplishing the mission of the article (understanding how to obtain a demand forecast) we will explore the data sufficiently but not necessarily completely. First, let’s understand the pattern of orders.

print(bike_data_clean.describe())Quantity Sold
count   11837.000000
mean        1.256991
std         0.667287
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max        10.000000

Looking at the numeric column (Quantity Sold) it looks like the bulk of the orders are for individual pieces of equipment because the 25%, 50% (median), and 75% are all for orders of quantity 1.

Next, let’s understand the orders by product.

print(bike_data_clean['Product'].value_counts())Cool Kids Bike         6417
Wheel Chair            4033
Boring Kids Bike        521
Adult Road Bike         438
Adult Mountain Bike     428
Name: Product, dtype: int64

It looks like the top two products are the Cool Kids Bike and the Wheel Chair. The remaining products are ordered at an order of magnitude lower than those two products.

Next, let’s understand how these orders are distributed across time.

# Groupby month and plot quantities sold only for certain time period
bike_data_clean.groupby(pd.Grouper(key='created_date', freq='M')).sum()['Quantity Sold'].plot(figsize=(17,3))

This is a little difficult to read so let’s zoom in on 2021 and 2022

# Zoom in on 2021 and 2022
bike_data_clean.groupby(pd.Grouper(key='created_date', freq='M')).sum()['Quantity Sold'].plot(figsize=(50,10), xlim=("2021-01-01","2022-12-31"))

It looks like the dips in demand are heavily concentrated around the summer months (May/June/July).

Next, let’s dive into understanding each particular product.

# Understand each product
bike_data_clean.groupby([pd.Grouper(key='created_date', freq='M'),
                        pd.Grouper(key='Product')]).sum()['Quantity Sold'].unstack().plot(figsize=(17,3))

Distribution of Orders by Product by Month

It is difficult to read the graph because it is trying to display too many lines at once. Let’s filter for the high volume (Cool Kids Bike and Wheel Chair) and low volume (remaining products) and create separate graphs.

high_volume = ['Cool Kids Bike', 'Wheel Chair']
bike_data_clean[bike_data_clean['Product'].isin(high_volume)].groupby([pd.Grouper(key='created_date', freq='M'),
                        pd.Grouper(key='Product')]).sum()['Quantity Sold'].unstack().plot()

Distribution of High Volume Products by Month

It looks like the high volume distributions mimic what we saw in the overall distribution (which makes sense as the overall distribution is driven by these two high volume products).

bike_data_clean[~bike_data_clean['Product'].isin(high_volume)].groupby(
    [pd.Grouper(key='created_date', freq='M'),pd.Grouper(key='Product')]).sum()['Quantity Sold'].unstack().plot(figsize=(17,3))

Distribution of Low Volume Products by Month

The distribution of low volume products is less smooth (which is expected) than the high volume, but with similar distributions as the high volume products.

Taking into account all we have learned about our data from our exploratory data analysis (the bulk of our products follow a similar distribution of orders throughout the year) we have a decision to make. In a ‘normal’ process we would ask the business more questions about the distribution of orders and if the data represents reality. For the purposes of this article let’s assume there is strong seasonality and that orders for bikes are really low in May, June, and July. Let’s also assume that the orders are lower during these months because the primary customer of these bikes are in the south of the USA and it is too hot during these months for people to ride bikes.

3. Generate different forecasting model(s)

This section consists of using our cleaned up data to generate demand forecasting models to predict the future. To start our demand forecasting we will use the python library sktime. In order to generate demand forecasts we need to prepare our data further by removing unused columns, setting the index to the date, and grouping by month.

# Prepare the data for time series forecastingforecast_ready_2 = bike_data_clean.set_index('created_date').\
    sort_index().\
    groupby([pd.Grouper(freq='1M'),'Product']).\
    sum().\
    reset_index().\
    set_index('created_date')

For demonstration purposes of this article I will only go through forecasting one of the products (the Cool Kids Bike). The first step is breaking out just the data for that one product.

# Split data into products
cool_kids_bike_2 = forecast_ready_2.loc[forecast_ready_2['Product'] == 'Cool Kids Bike'].\
    drop(labels=['Product', "single_digit_year"], axis='columns').\
    assign(created_date = lambda df: df.index.to_period('M')).\
    set_index('created_date')

For simplicity, we will generate a demand forecast using a naive forecast with a 12-period seasonality. This is going to develop a forecast that uses the 12-period previous value as the predicted value. This means that for future predictions we are just using the values we have from the previous month’s data from the previous year.

import numpy as np
from sktime.forecasting.naive import NaiveForecaster

forecasting_horizon = np.arange(1,25)

forecaster = NaiveForecaster(
    strategy='mean',
    sp=12
)
forecaster.fit(cool_kids_bike_2)
y_pred = forecaster.predict(forecasting_horizon)

We will visualize it using a plot

y_pred.plot(figsize=(17,3))

Cool Kids Bike prediction 24 months in the future using a naive forecast

Repeat this methodology with other types of time series forecasting methods like ARIMA, exponential smoothing, etc.

4. Select one forecasting model to predict demand

After repeating this methodology we would use some performance metrics on existing data to select the ‘best’ model and use that to predict demand. This is a very important step that requires a significant investment in time and effort. For the purposes of this article, I will briefly summarize these steps.

For instance, we could separate the data into 3 buckets: training (70%), validation (15%), and testing (15%). Since the data is over 10 years (2021–2030), we could use the first 7 years to train as many models as feasible and helpful for my specific situation. After those models are trained, we would use those models to generate predictions against the 15% of data in the validation bucket and compare it to the actual demand. We will choose whichever model performed the ‘best’ according to metrics we select. Lastly, we will use that last model and generate predictions against the last 15% of the data to understand how well the data may perform in the real world. For this application we will use the one model we generated for brevity’s sake.

To start the visualization of the entire application I’ve started up code to generate a Dash application. If you would like to see the raw data and/or the code below see this github repository: https://github.com/johnkangw/medium_article_data

"""Capacity Dash Application"""

from dash import Dash, html, dcc
import plotly.express as px
import pandas as pd
import numpy as np
from sktime.forecasting.naive import NaiveForecaster

app = Dash(__name__)

# Import over data to analyze
# Read in and save the raw data
bike_data = pd.read_csv("bike_data.csv")
# Clean the data using Pandas methods that are chained together
bike_data_clean = bike_data.\
    drop(['Unnamed: 0', 'Inspect ID', 'Style Code',
          'Order Number', 'User Code', 'Defect Found', 'Maint Date'], axis='columns').\
    rename(columns={'Line Number': 'Product', 'Quantity': 'Quantity Sold'}).\
    assign(created_date = lambda x: pd.to_datetime(x['Created Date'], format ='%Y%m%d')).\
    drop(['Created Date'], axis='columns')


# Groupby month and plot quantities sold
bike_data_clean.groupby(pd.Grouper(key='created_date', freq='M')).sum()['Quantity Sold'].plot()
clean_data = bike_data.assign(**{'Created Date': lambda x: pd.to_datetime(x['Created Date'], format="%Y%m%d"),
                                 'Maint Date': lambda x: pd.to_datetime(x['Created Date'], format="%Y%m%d")})
# Prepare the data for time series forecasting
forecast_ready_2 = bike_data_clean.set_index('created_date').\
    sort_index().\
    groupby([pd.Grouper(freq='1M'),'Product']).\
    sum().\
    reset_index().\
    set_index('created_date')
# Split data into products
cool_kids_bike_2 = forecast_ready_2.loc[forecast_ready_2['Product'] == 'Cool Kids Bike'].\
    drop(labels=['Product', "single_digit_year"], axis='columns').\
    assign(created_date = lambda df: df.index.to_period('M')).\
    set_index('created_date')

forecasting_horizon = np.arange(1, 25)
forecaster = NaiveForecaster(
    strategy='mean',
    sp=12
)
forecaster.fit(cool_kids_bike_2)
y_pred = forecaster.predict(forecasting_horizon)
y_pred.index.name = "Date"
y_pred.rename(columns={y_pred.columns[0]: "Forecast"}, inplace=True)

y_plot = y_pred.reset_index()
y_plot.Date = y_plot.Date.astype('datetime64[ns]')

fig = px.line(y_plot, x="Date", y="Forecast")

app.layout = html.Div(children=[
    html.H1(children='Naive Demand Forecast for next year plotted'),

    html.Div(children='''
        Plot
    '''),

    dcc.Graph(
        id='example-graph',
        figure=fig
    )
])

if __name__ == '__main__':
    app.run_server(debug=True)

Plot Generated from the code above using Dash and Plotly Express

In the next article we will model out our supply so that we can match it up with this demand.

Capacity Modeling Using Python Step 1: Demand Forecasting

1. Clean demand data

2. Perform Exploratory Data Analysis

3. Generate different forecasting model(s)

4. Select one forecasting model to predict demand

Written by John Kang