Why biopharma manufacturing needs to leverage network optimization techniques via data science

Published in

GAMMA — Part of BCG X

11 min readJun 19, 2018

Despite large R&D budgets, the life sciences industry has been slow to adopt innovative technologies such as analytics and data science. Over the past two years, however, the tide has turned. Uptake is accelerating not only in R&D, but also within the four walls of manufacturing. In fact, several major biopharma companies are already applying analytics to their manufacturing data to predict lead times and throughput, with accuracies high enough to surprise even the data scientists who engineered the capabilities in the first place.

The recipe for a successful data science effort in biopharma manufacturing is a team with both technical skills and biopharma domain expertise. The data scientist must first understand the data’s complexity for herself, then be able to communicate that knowledge to others in such a way that they understand (and believe) her predictions and prescriptions.

All of which begs the question: how can data scientists do either of those things with data as complex as that found in biopharma manufacturing?

My answer to that question is influenced by the consulting work I’ve done around geoanalytic systems integration over the past 25 years. Since 2013, I have been focusing that expertise on network optimization — how to move goods or people from one place to another with maximum efficiency.

Network optimization is normally applied at the scale of a city, region, country, or even globally. Decisions about where to open a new store, which products should flow through which distribution centers, or whether to ship by land, sea, or air all optimize a company’s supply chain at the macro level.

Manufacturing is at the opposite end of that spectrum, often within the four walls of a single building. In my experience, the consensus among geoanalytic professionals has been that applying spatial analytics within a plant’s four walls might add a little value, but not as much as with more regional-scale geoanalysis projects.

But I have come to believe that consensus was wrong.

The big value that applying network optimization techniques to manufacturing data brings is that it makes the movement of materials visible and in the process, easy to reference as they branch and merge through laboratories, machines, and assembly lines. When material movements are easy to reference, they also become easier to predict. And, with clear views of past, present, and future material movements, it’s also easier to prescribe more efficient operating decisions.

Just as transportation network planners use highway maps to illustrate where bottlenecks are or how to reduce traffic congestion, biopharma manufacturing leaders can use data visualizations to illustrate where the batches in their drug product recipes are backing up, or data science to predict which steps will be the most volatile and how to smooth them out.

Visualize it

A common first step in biopharma data science is to visualize end-to-end processes and track their metrics. For example, site leads may want to track the volatility of lead times for thousands of batches that all follow the same drug production recipe. The flow of materials in such a recipe could look like this Sankey diagram, where thicker flow lines represent higher volatility, and those in pink mark those with a variation greater than one month. (See Exhibit 1).

Exhibit 1: A Sankey diagram visualizing the volatility of lead times for thousands of different drug batches

Predict it

Some companies have already gone beyond initial data visualization to also use data science to predict and prescribe. For example, at one biopharma manufacturer’s plant there were more than 100 quality events open at any given time. In the drug substance stage, the active ingredient concentrations might be too low. During drug production, a batch of tablets might be too friable. During packaging, ink on labels might smear.

When the workload necessary to resolve those issues spiked and staffing became stretched, some events remained unresolved long past their closure deadlines. That’s because even though the work to resolve a quality deviation varied from one event to another, it was company policy to give them all the same deadlines. Inevitably, managers would assign too many difficult events to one person and too few to others. The managers didn’t intend to create workload imbalances. They simply lacked a reliable way to estimate the relative difficulty of individual events.

That changed when data scientists built a machine learning model that was able to accurately predict which quality events were likely to have severely delayed closures. The company’s quality organization ran the model every day with a combination of more than 30 different predictive inputs, including which operation and material were involved, whether the material came from a vendor, and the brevity or lengthiness of the deviation’s response rationale. Results were very promising: for the target drug brand, the model predicted more than 85% of the severe delays weeks before they happened, in some cases the moment the deviation was first identified.

Having prior warning of delays enabled the plant’s quality managers to prioritize and assign deviation resolution tasks more equitably. People with less work on their plates or more experience could be reassigned to the tasks that the machine learning model predicted would be delayed. And downstream, the material supply leads could plan ahead for any delays triggered by the ripple effect of the potentially delayed quality checks. This new knowledge led to a smoother operation overall, with cost savings from reduced overtime, fewer expedited shipments, and higher throughput.

In other words, the more insights that analytics uncovers, the more biopharma manufacturers recognize that they really do need data science.

Integrate it

What biopharma manufacturers find less clear is how to begin integrating data science into their daily operations. While this is a common challenge to all industries, drug manufacturing has its own special barriers to integration.

In a sales and marketing department, very big but also very simple “point-of-sale” data sets are common. In that context, point-of-sale data means daily cash register (or online) transactions from millions of customers who pay for the company’s products — anything from a cup of coffee to a streaming movie to car tires. While deriving valuable insights from such simple data structures is certainly challenging and requires creative data science, establishing a universal understanding about the data among all stakeholders isn’t hard.

Drug manufacturing data, on the other hand, in particular biopharmaceutical drug manufacturing, is a completely different story. Its structure is highly complex, varies from one drug brand to another, and is generated in dozens of different systems distributed across the organization, not just from cash registers. I’ll come back to the complex structure, but first let’s consider distributed data.

Drug manufacturing data resides in digital systems that have typically been in place for years, often in enterprise resource planning (ERP) systems such as those from SAP. When company leaders ask for what they believe is a simple report based on that ERP-housed data, their teams may work for weeks to dig up the answer. Those who know why will tell you it’s because the information lives in different legacy systems (sometimes called “data silos”) that take time to integrate when new questions arise that don’t match each silo’s originally intended purpose. This is an old, familiar pain point to anyone with an IT background.

Incidentally, data science does provide new machine learning alternatives to data and systems integration. A good example of off-the-shelf software designed to speed up silo integration is Tamr, which combines automated matching with human guidance to speed up integration by orders of magnitude. But I will cover that in a later blog post.

Aside from examples like Tamr, the more general gift of data science in addressing the data silo pain point is simply to identify big cost savings or new revenue generation opportunities. Nearly all of them will only be realized with data consolidation, so any funding for data science work will necessarily also have to address the silo pain point. Peter Guerra and Kirk Borne say it well in an O’Reilly report, “Ten Signs of Data Science Maturity”:

“Big data isn’t about the volume of data nearly as much as it is about ‘all data’ — stitching diverse data sources together in new and interesting ways that facilitate data science exploration and exploitation of all data sources for powerful predictive and prescriptive analysis.”

Specific to biopharma manufacturing, we need to combine live data feeds such as the material movements data in ERP systems, the deviations and incidents in quality tracking and laboratory investigation management systems, and the machine operating data from manufacturing execution systems. We also need to join this rich set of batch trace data to financial data such as costs and revenue down to the SKU level, and to link actual output to market demand.

If data silo problems are a pain point for a manufacturing company’s leadership, what’s the biggest challenge for the data scientists themselves? In other words, if all the data were already stitched together, could data scientists immediately pull value out of it?

Well…some value, maybe, but another big challenge would remain: data complexity.

Unlike marketing and sales data, operations data is fiendishly complicated. A biopharmaceutical company can have dozens of plants, each responsible for a subset of the entire end-to-end manufacturing process and each manufacturing many different drug brands (each with multiple SKUs, some coming from third-party contract manufacturing organizations). One plant might make the drug substance (e.g., a liquid serum); another, the drug product (e.g., the serum distilled into tablets); and a third, the packaging. Each brand can either require its own set of machines and lab equipment or share some equipment with other drug brands at the same plant. Moreover, some recipes have more than a dozen different series of sub-recipes that run in parallel and are later formulated into a single substance. And recipes change over time.

Modeling this complexity is difficult. But once a model works as desired, the overall job is still not done; it’s then time to scale up to other brands at other sites. Unfortunately, the more sophisticated your data science model becomes for one drug brand and site, the less likely it will easily scale up to all brands, or even to any other brands. Not that it’s impossible — in fact, it’s both possible and fascinating — but the learning curve of domain knowledge is steep. This is where network optimization can help.

Optimize it

Initially, the big value of applying network optimization techniques is as simple as establishing a common agreement around what the manufacturing process actually is and having a reference point that organizes all of that processes’ disparate data. But then — at least in every case I’ve seen so far — democratizing the data through accessible visualizations opens up the discussion for predictive and prescriptive analytics similar to the quality deviation example introduced earlier.

Visualizations can start with a Sankey diagram of a drug “recipe” — just the raw materials and how they’re combined to make semi-finished or finished goods. Here’s an example built from fictitious data. (See Exhibit 2).

Exhibit 2: A Sankey diagram of a drug recipe

How do you read these diagrams? Most of the thin vertical rectangles in this diagram are manufacturing materials — raw materials, active ingredients, packing materials, semi-finished products, and finished goods. The green bars (labeled “site transfer” in the legend) are loading docks, which emphasize not just the recipe, but where the cooking happens. All the vertical bars are joined together with thick curvy lines, colored either gray or pink, to show how materials change as they move through the manufacturing process. A single material might become a different single material, such as when a powder substance goes through a compressing machine and comes out as tablets — no new materials were added to the powder, but it is now a “new” material with a different format. In most cases, two or more vertical bars merge into a common single downstream bar. This means the upstream materials were mixed together to form a new downstream material. To represent just the basic recipe, all the links between bars are the same thickness; they only show the connections between materials in the recipe. Some are pink, indicating the flow of active ingredients through the recipe. Most are gray, indicating that they are excipients, raw materials, or packaging materials.

At a glance, we can immediately see that manufacturing this product involves three different steps at three different locations before being sent to market; let’s suppose they represent the drug substance, drug production, and drug packaging steps from left to right. We can see that the third location is where all the packaging happens, as we would expect for the last set of steps. We can also see that the active ingredient is part of the process from the very beginning, starting in the facility on the left of the diagram, and that it flows all the way to the finished good on the right.

On its own, this biopharma recipe shown as a Sankey diagram might be useful as a reference, but otherwise has only marginal utility. It’s neither predictive nor prescriptive, and it doesn’t change very often over time. Fortunately, things get more interesting when the same recipe diagram is reset to represent all of the batches that have ever been manufactured by following that recipe. In this approach, the vertical bars represent the same materials (or loading docks), but the links between them now become thinner or thicker based on a metric, such as lead time volatility.

The next three diagrams illustrate the concept using the same common recipe as a starting point, but with different link dimensions to represent which steps have the most volatile lead times (See Exhibit 3), which step have the most throughput volatility (See Exhibit 4), or which steps have the highest material loss (See Exhibit 5).

Exhibit 3: Sankey diagram illustrating lead time volatility

This first view of batch flow through the recipe, in the diagram above, indicates that three out of the four most volatile active ingredient steps in this drug’s production history are the transportation between sites or to market. The most volatile one of all is highlighted in a darker pink color and symbolizes the transport of goods between the drug substance site and the drug product site. This could point to an opportunity to cut costs by reducing expedited shipments or otherwise shifting the mode of transit to a more reliable carrier. On the upside, the diagram confirms that our packaging material management has some distinctly non-volatile processes. Understanding why that part of the production is running smoothly might help us achieve similar results in the other parts.

The very same recipe, in the diagram below, looks completely different when reshaped to reveal throughput volatility. We see that throughput is most volatile in the active pharmaceutical ingredient (API) stages of production. While lead time varied the most in shipments between plants, with throughput, it’s drug substance manufacturing itself that’s most variable.

Exhibit 4: Sankey diagram illustrating throughput volatility

Finally, in the version of the recipe shown below, where the links thicken to show excessive levels of material loss, the drug substance steps are again the most problematic. Maybe it’s been a long time since the seals on our API storage containers have been replaced, and we’re losing material to evaporation.

Exhibit 5: Sankey diagram illustrating material loss

As previously noted, the recipe that these four diagrams illustrate is fictitious and the recipe has only a few materials, so that it’s easy to understand. Real biopharma recipes are far more complicated, making this diagrammatic approach leveraging network optimization all the more valuable.

The diagram at the beginning of this blog is a sanitized real-world recipe. In that example, rather than representing active ingredients, the pink lines show where the standard deviation of lead times is greater than 33 days. What would you do if your packaging data showed that much variation in lead time? If you ask me, I’d start applying predictive and prescriptive data science to isolate and fix the problem!

Why biopharma manufacturing needs to leverage network optimization techniques via data science

Written by Jonathan W. Lowe