Demystifying the Pharma Demand Forecasting & Supply chain using Snowpark ML & Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

12 min readSep 29, 2023

**Figure 1:** A high level overview of demand forecasting & supply chain model

In the rapidly evolving world of pharmaceutical manufacturers, effective demand forecasting plays a pivotal role in ensuring the seamless distribution of drugs to meet patient’s needs and mitigate their discomfort arising from drug shortages. Demands for a drug, unlike other retail products is unique in the sense that it can be influenced by several factors including:

Purchase power: whether a patient prefers a branded or the generic version of the drug for the same ailment
Drug access: whether a drug is over the counter (OTC) or prescription-only in which case it requires specialist dispensaries
Seasonal fluctuations: certain class of drug would be required during certain seasons (e.g. in US, the allergy products see an increase in consumption during pollen seasons in the spring and flu vaccines have a larger demand during the transition months to winter )
Atmospheric conditions: humidity and air quality index can trigger asthmatic conditions in patients suffering from pulmonary disorders and drive a need for those drugs
Pandemic/epidemic impacts : having a medication could trigger a side effect that would trigger the need for a concomitant medication which can cause a spike in demand when there is a larger event like epidemic. For example, the unforeseen impact of COVID-19 caused not only supply disruptions but also confounding side effects. Most of the vaccinated population suffered from pain or fever triggering an increase in spike for OTC pain relievers. Sudden outbreaks led to soaring demands, while lockdowns and vaccine discoveries disrupted supply chains
Global disruptions: Events across the globe, like a forest fire or volcanic eruptions can contribute to a very bad air quality in distant places that can also increase the uptake for drugs or certain medications for people with pulmonary disorders in addition to causing supply and distribution challenges.
Vacation season: people tend to travel a lot more during holiday season increasing the chances of infection / flu spread
Sentiment analysis: social media can like any other product influence the buying pattern of a consumer especially in the generic/OTC market where a effectiveness of a certain “branded” version of a drug may be deemed more effective than a “generic” version of the same even if they have the same ingredients and
Supply chain complexities : pharmaceutical manufacturing also involves sourcing ingredients from different suppliers, necessitating seamless coordination between demand forecasting and supply chain management to ensure timely production. In the context of pharmaceuticals, an active pharmaceutical ingredient (API) refers to the core molecule responsible for the desired treatment effect while the excipient refers to the components needed to formulate the drug and deliver it inside your body effectively. For example, the below table highlights the excipients in a popular over the counter antihistamine (cetirizine) used commonly to treat allergy symptoms. As can be observed there are over 5 different ingredients which all need to be sourced from different suppliers in order to formulate one cetirizine tablet

Given the complexity of both sourcing supply and various factors affecting demand, the need for an ML based forecasting becomes critical to enable companies to stock and produce the right quantity of drug products. This is even more true for those modalities with short shelf life which cannot have a high inventory backlog (e.g. vaccines). ML models with right feature engineering and parameter tuning can provide accurate predictions in a time-efficient manner, a feat that can be challenging for traditional statistical methods.

An efficient model in this aspect will allow pharmaceutical companies to plan their drug supply accurately and respond promptly to patients’ actual demands, which are subject to seasonal & external event driven variations.

In this blog we will demonstrate how to build such an end to end demand forecasting model natively in Snowflake leveraging key Snowflake capabilities such as :

Data capture with Snowflake marketplace
Model training with Snowpark ML
Model inferencing for real time predictions with Snowpark ML UDFs
End user application with Streamlit in Snowflake

The following table illustrates the end to end capabilities of Snowflake to make this self sufficient

Why is this new and what does good look like?

Forecasting based on Machine learning is not new, however traditional forecasting models often fail to accurately predict high sales volume and high volatility scenarios. They typically use historical sales data to generate future forecasts by choosing a best model trained on past data& buying trends. This can lead to underestimating demand for the future especially in a industry like pharmaceutical that as explained earlier depend on a lot of external factors (like seasonal allergies, holiday seasons, etc). A model built for forecasting will require constant retraining based on fresh market perspectives to remain relevant in the long term.

To demonstrate this we’ll explore a hypothetical scenario involving a pharmaceutical manufacturer and lets call them Company A. Let’s imagine that they aim to stock a branded allergy/asthma medication in North America across various retail outlets, including pharmacies and other over-the-counter establishments. By utilizing some of the newest offerings from Snowflake like Snowpark ML we will show how demand forecasting efficiently handles seasonal demand fluctuations to ensure on time delivery to the stores while sourcing raw materials in time by anticipating disruptions in supply chain.

The entire pipeline right from sourcing the data needed to model training , model inferencing and visualization will be done on Snowflake thus allowing you to take advantage of all the data cloud capabilities and extend it to machine learning analytics.
What this empowers you to do is to develop and deploy a business user powered application quickly by reducing the development time from weeks to days and eliminating ETL where possible and reducing unnecessary data hops

A brief overview of Snowpark and Snowpark ML

Snowpark as you all might be aware is a developer framework that offers native SQL, Python, Java, and Scala support for fast and collaborative development. Snowpark ML, now in preview, at the time of this writing is a companion to Snowpark Python built specifically for machine learning in Snowflake. The modeling package (snowflake.ml.modeling) provides APIs for data preprocessing, feature engineering, and model training. The package also includes a preprocessing module that takes advantage of compute resources provided by a Snowpark-optimized Warehouses to provide highly scalable data transformations. These APIs are based on familiar ML libraries including scikit-learn, xgboost, and lightgbm.

For this demonstration of demand forecasting, we will be taking advantage of Snowpark ML that works with Snowpark Python. Additionally, the model will not be purely trained on time series but we will also be leveraging product metadata and ailment condition related features, so that the demand is generated based on the targeted signal for the single model.

Demand Forecasting Model Building and Training

To build the demand forecasting model, we will follow these steps:

Data Exploration
Perform feature engineering to accommodate the complexity of variables involved in demand
Train a model using classic XGBoost wrapper with Snowpark ML
Deploy the model in Snowflake as a UDF
Build a native Streamlit Application to enable end users interaction

We will now look at each step in detail

Step 1: Prepping your data

Machine learning models are effective only when the data used to train them is of good quality and in right shape. For the demand forecasting we need to pre-process the data and impute attributes containing nulls and removing any unwanted information by a process of cleansing. Cleansing involves removing any outliers and duplicate entries, replacing any inaccurate or irrelevant data and additional imputation techniques to correct any missing values. Once this is done we can then proceed to perform feature engineering. For Company A , since they specialize in the manufacture of allergy medications, we have leveraged the following data sources listed in Table 1 for model training.

Do note, that there are 5 external data sets that are used for this model and 2 internal data sets that vary for each customer. You can see the complete data model that will be leveraged for our Machine learning in Figure 2.

The power of Snowflake marketplace is the ability to bring together such a rick ecosystem of diverse data sources into your model without having to set up a separate ETL pipeline.
For more about Snowflake marketplace, refer this link.

***Figure 2: Conceptual schema for the Demand forecasting model***

Step 2: Performing Data Exploration

Once data has been prepped we proceed with data transformation and exploratory data analysis to understand the underlying statistics and trends of the data. To do this, we read data from the order details into a Snowpark DataFrame and perform a quick exploratory analysis of the data. One key observation we can already infer from the chart below is the set of buying patterns during allergy season, which typically occurs around March and April, as well as from August to October, in the US. This information could prove useful in informing us around the Feature engineering strategies. The below analysis and plotting have been performed in the Snowsight UI leveraging native Python libraries.

Figure 3: Plot of Orders fulfilled per month to sales

Step 3: Performing Feature Engineering

Feature engineering involves analyzing the available features and selecting the most relevant ones to build a model. This can be done through various techniques such as feature creation, feature transformation, feature extraction, and feature selection. For Company A , the product dataset that includes the functional metadata as to whether the product is a generic or brand, drug family, pricing details is an essential aspect of the demand forecasting model.

Since it is necessary to provide the model with more than one feature as outlined in the introduction, we combine product metadata with others including ailment condition-related features, holiday calendars, weather ( (temperature and humidity), social media sentiment (product feedback from patients on effectiveness), population ailment in zip code (general population trend for allergy-related illness in a given zip code), the allergy index, which considers environmental allergy indicators such as weeds, grasses, and pollen as input features (Refer Table 3).

For e.g assigning a higher index value for allergy indicates that more people with allergies and asthma may potentially be affected by weather factors, such as changes in temperature or allergens in the air. By including this factor, the model can provide a more accurate demand forecast. After this is done, the features, along with the target variable PRODUCT_SALES, will be saved in a Snowflake table.

While there there are various methods to conduct feature selection a heatmap will help quickly visualize the strength and direction of correlations between variables in the data. For Company A, the hypothesis around how the months and allergy indicators are correlated can be verified visually in a plot like in the one in Figure 4. As in the case of data exploration, this plot has also been generated natively in Snowflake.

Figure 4: Heamap of feature coorelations

Step 4: Building a classic XGBoost Regression model for Model Training using Snowpark ML

Snowflake-ML Modeling provides a set of scikit-learn and xgboost compatible estimators and transformers that can be used to build and train machine-learning models within Snowflake.The main advantage of using snowflake-ml is that it enables users to perform machine learning tasks directly on their Snowflake data without having to move the data out of Snowflake and without having to define and deploy stored procedures that package scikit-learn or xgboost code.

Hyperparameter tuning is a critical step in the model-building process, and for this, GridSearchCV was adopted. GridSearchCV, short for Grid Search Cross-Validation, is a systematic approach to hyperparameter tuning that involves creating a grid of possible hyperparameter values and exhaustively searching through all possible combinations using cross-validation to find the combination that results in the best model performance. The Snowflake ML modeling class snowflake.ml.modeling.model_selection has native support for GridSearchCV. Domain knowledge and balancing model complexity helps to tune the appropriate hyperparameters and achieve an effective model.

Finally , when it comes to evaluation it is important to measure the gap between predicted values and actual outcomes. Error metrics is the key in quantifying this gap, valuable insights into the model’s strengths and areas for improvement was gained. The Mean Squared Error in short the MSE calculates the average of the squared differences between predicted and actual values. It emphasizes larger errors due to squaring, making it sensitive to outliers. It’s a popular choice for regression tasks and is also used in the model evaluation.

Step 5: Setting up the model for inference using Snowpark Python UDF’s

To enable ongoing inference, a Vectorized User-Defined Function (UDF) is created. Vectorized UDFs are ideal for offline inference in batch mode and we have created a Snowpark Python UDF and included the trained model as a dependency. Once this is registered, obtaining new predictions is relatively straightforward.

This UDF can also be directly invoked from the Streamlit App which is what Company A has done in this scenario.

Step 6: Building a demand forecasting app with Streamlit in Snowflake (Sis)

Streamlit in Snowflake is a great way for non-technical people to quickly understands insights gained from complex analytics. For this use case, a simple Streamlit app that demonstrates demand outcomes per site is a great way to provide the insight engine in the hands of the decision makers who need to act on the outcome.

For Company A, the experience begins As shown below, the history of transactions for the past quarters as seen in Figure 5

Clicking the “Forecast Demand for Next Quarter” button triggers the vectorized Python UDF, which employs the demand forecasting model to carry inference and generate predictions as seen in Figure 6

Figure 6: Demand forecasted for upcoming quarter

In the event of a resilience event that affects a supply chain network, it is possible that a supplier may also be impacted. To ensure that such suppliers are identified during the forecasting process, the design flags any potential concerns that may arise. This is important because it allows pharmaceutical companies to proactively address any issues that may arise and ensure that the supply chain remains robust and resilient.

This is where the Resilinc data set is of vital importance as it can map the supplier location to any external event alerts for that location which predicts any related delay in sourcing materials.

In this manner, Company A leveraged a prebuilt model within their ecosystem to visualize risks in their demand fulfilment due to sourcing impacts from external suppliers as can be seen in Figure 7. This demonstrates another power of Snowflake to string together different ML functions to provide an end to end demand /supply chain application. This is another demonstration of the strength of creating a Vectorized UDF

*Figure 7: Forecast Demand impacted by Supply chain events*

By leveraging advanced machine learning algorithms and data-driven approaches, companies are able to identify potential risks early on and take appropriate action to mitigate their impact. This helps them to maintain optimal inventory levels, reduce costs, and ensure that patient demands are effectively responded to.

Conclusion

In conclusion, demand forecasting is a game-changer in the complex world of pharmaceutical supply chain management. It empowers companies to respond effectively to patient demands, mitigate drug shortages, and maintain optimal inventory levels given their short shelf life. As the pharmaceutical landscape continues to evolve, embracing advanced technologies and data-driven approaches will be crucial for creating a resilient and efficient supply chain that can withstand any challenges that come its way. With this in mind, Snowflake offers a single centralized data foundation & ML eco system with which you can depend on both internal data and also third party data from its marketplace to provide accurate forecasting and supply chain models to optimize your demand.

Last but not the least we have implemented an end-to-end ML pipeline all within Snowflake.

From a technical stand point we leveraged :

SnowparkML for model training and model ops
The Multi-node training in Snowflake warehouse by carrying a distributed hyper parameter tuning using Snowpark ML Modeling API.
Modeling API to avoid the manual process of wrapping scikit-learn libraries inside a UDFs and Stored Procedure.
Streamlit in Snowflake for creating a data app for visualization and analysis

We will follow this blog with another in the series that breaks down the code snippets leveraged for each of the tasks starting from data preparation all the way to model inferencing.

Please also visit the Supply chain blog for details on how the Supply chain optimization is performed. You can also see this entire demand forecasting in action by visiting the webinar. If interested , please do not hesitate to contact us for a more detailed discussion.

Demystifying the Pharma Demand Forecasting & Supply chain using Snowpark ML & Snowflake

Why is this new and what does good look like?

A brief overview of Snowpark and Snowpark ML

Demand Forecasting Model Building and Training

Conclusion

Written by Harini Gopalakrishnan