Deep Learning: In Production

Published in

Systems AI

8 min readJul 16, 2019

Deep learning (DL) is incredibly challenging, and it gets worse when deploying to production.

Trying to figure out what to do with your latest algorithmic creation.

The topic of deploying machine learning (ML), of which DL makes up a subset, systems into the wild has been explored in the literature⁽¹⁾⁽²⁾⁽³⁾ by researchers such as D. Sculley. However, this has typically only been through the lens of more “traditional” ML systems i.e. working with structured datasets. In this article, I will consider how the phenomena known to impact traditional ML manifests itself in DL systems.

To do so I will look through 3 factors which introduce system-level complexity. This type of complexity impacts how different systems, therefore people and ultimately data, interact with one another meaning that it can cause some serious headaches.

System-level complexity is above and beyond the type of technical debt expected within software engineering, which can be remedied with refactoring, better tests, and tightening of APIs- amongst other things. These methods of reducing technical debt are still perfectly valid for ML and DL systems, but further mitigation tactics should be investigated- some are suggested in each of the “How do I mitigate it?” sections.

So, without further ado…

1. Entanglement

What is it?

When building a higher level model atop a foundation of data you are effectively entangling your data and model very closely with one another. Otherwise independent data sources are now tied together when generating outputs from your system.

In practice this means that if and when the underlying data changes, e.g. the distribution of features alters, the model’s outputs will be impacted. This phenomena in the literature is called CACE- Changing Anything, Changes Everything. More colloquially known as the butterfly effect.

How does it impact DL systems?

Interestingly, entanglement presents less of a problem to DL systems, than their traditional ML cousins. DL systems frequently rely on a single source of highly complex data, e.g. images from an MRI machine, meaning that independent data sources do not have to be blended together.

However, due to the highly complex nature of the data sources picking apart relevant monitoring metrics to track the data is incredibly difficult, and effectively visualising data is tricky. Therefore, it is wise to collect relevant meta data which can provide important contextual information for your datasets. Figure 1.1 shows what this could look like in practice for a DL system used to identify cell types in tissue samples. In the example, the training data does not contain samples from the Tokyo lab. Therefore, it is highly likely that the model trained solely on London, Paris, and Berlin’s samples will misbehave on this new data. Without collection, and consequent monitoring on meta data this type of change is very difficult to quantify.

Figure 1.1: An example of why meta data needs to be collected.

How do I mitigate it?

Holistic monitoring of the data sources. Generate a set of data KPIs which represent “healthy” norms of your data. Create visualisations of the features within your data set to gain an “at a glance” view of the data. Collection of meta-data from your sources- particularly important for DL systems.
Try to decompose your problem into modular sub-problems. This will prevent having to aggregate a wide range of data sources. e.g. a stock prediction problem could be safely chunked into an ensemble of separate models, see Figure 1.2 below. This helps to keep separate data sources independent of one another.

Figure 1.2: Decomposing stock price prediction into three separate sub problems, each of which with a causal influence on the stock price.

2. Unstable Data

What is it?

Aggregating together disparate data sources to produce meaningful insights is a key premise of ML. However instability in the data sources which feed a ML system can cause unexpected, and unpredictable changes in the system’s performance over time.

Instability within data sources can be divided into two categories:

Implicit: Implicit instability is inherent to the data source itself. For example, a data source you are consuming may be the output from another ML system which itself is likely to drift and change over time.
Explicit: Explicit instability is caused by the influence of a third party on the data. For example, a specific data stream is owned by another team and they are likely to make conscious efforts to update that data source.

How does it impact DL systems?

Much of the data which feeds DL algorithms are noisy, and highly multi-dimensional. Therefore, unstable data is an inherent part of training and operationalising DL systems. Thus practitioners of DL will be well versed in handling noisy data sources when training algorithms.

However, issues can arise during deployment if there are differences between production and training data, or if there is model/data drift over time. This phenomena is shown in Figure 1.1, and explained in the surrounding text.

How do I mitigate it?

Versioning of data sources. For data sources which can be controlled, a versioning system is an easy way of ensuring that changes to the source have not been explicitly implemented. Remember that this change itself introduces staleness to the system.
Echoing the mitigation strategy of entanglement- holistic data monitoring. Keeping a close eye on the data flows within the system will allow drift, or fluctuations to be caught early, and systematically. It is worth repeating that for DL systems this is likely going to be meta data associated with primary data e.g. camera information associated with imagery. Meta data can be tricky to reconcile when training data is obtained, or scraped, from external sources. In these cases it is imperative that thorough testing is conducted on initial models with data identical to that of production sources- this will not be foolproof, but it encourages an initial repository of production-standard data and associated meta data to be built for testing purposes.
Contextual understanding of the problem you are trying to solve. DL systems cannot encode all of the contextual knowledge that you, or your team will know about the challenge you are attempting to tackle. Therefore, having a solid, foundational understanding of your problem is crucial. It is recommended that this is documented as explicitly and thoroughly as possible, in an accessible place- to allow team members to amend and update as their understanding changes. This will provide a basis for capturing information not contained within the data being fed into your system. If you believe that the problem is fully captured within the data you have been provided you are missing something.

3. Hidden Feedback Loops

What is it?

Creating and deploying prediction models in the wild may produce unexpected feedback loops as it interacts with the real world. This occurs when a data input to your system is impacted by the output of your ML model.

This is best illustrated with an example. Consider the case of a magical treatment system- which, given a range of data, spits out the treatment approach to remedy the affected patient. A feedback loop which would present itself in this scenario is the prescribed treatment method altering the patient data. Therefore, later diagnoses relying on the system would be reliant on this altered data.

In the example of an end-to-end diagnosis workflow; generation of feedback loops is obvious. However, taking a more real world example, where a DL system is only present in a certain part of the diagnostic workflow it should be clear that spotting these loops in the wild can be tricky. Figure 1.3 shows a typical diagnostic workflow for breast cancer.

Figure 1.3: Simplified diagnostic workflow for breast cancer. Highlighted in purple is the DL component of the overall system.

As indicated within Figure 1.3, the DL component makes up a fraction of the total workflow, however this whole system is still representative of the end-to-end solution described previously- except that doctors do much of the heavy lifting, as opposed to a set of algorithms. Therefore, a feedback loop still develops over time- as patients are sent for multiple mammograms over the course of their treatment to monitor their health. The feedback loop’s direct affect on our DL system is now abstracted by a series of proxies where doctors and other medical staff work. It is hard to say what the impact to the DL system would be of this feedback loop, but it is plausible to imagine that unless the system has been trained (or re-trained) on a historical series of mammograms from the same patient it could have an adverse impact on later predictions, as treated patients are a group that has been invisible to the model until now.

How does it impact DL systems?

Where they occur, hidden feedback loops will cause instability regardless of whether it is a ML or DL system they are impacting. Unfortunately for all of the DL enthusiasts out there hidden feedback loops within DL systems are exacerbated. A fundamental factor within this is due to the black box nature of the algorithms being used. Without the ability to query model outputs it becomes incredibly difficult to determine how they are changing over time, and therefore if a feedback loop is manifesting.

How do I mitigate it?

Primarily, work towards generating explainable, and audit-able outputs of your system. Algorithms such as LIME (Local Interpretable Model-Agnostic Explanations) can be applied to many modern neural network architectures to add some level of “reasoning” to DL models. More simply, adding confidence readings to model outputs can help remedy some issues.
Hidden feedback loops can take a long time to manifest themselves. Particularly if a model is being re-trained and slowly but consistently adjusts its ‘opinion’ of a feature over time. Tracking of model KPIs is an important one- these would be a different set of KPIs to those considered earlier in the entanglement mitigation strategies. The trend to be vigilant for would be a consistent upward or downward trend over time without model re-training. This indicates a change in predictive behaviour without meddling from you or your team.
Be mindful of changes to the external environment in which your model is deployed. Listen to the subject matter experts within your team as they will be able to provide you with invaluable understanding of the world the model is acting within. Impact of getting it wrong?

Conclusion

I hope this has given you some greater insight into the challenges faced when deploying both ML and DL systems to a production environment.

This has only scratched the surface of the issue, and so if you would like to see more then please drop a comment below!

References

⁽¹⁾ D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young. The High-Interest Credit Card of Technical Debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). 2014.

⁽²⁾ Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. What’s your ML Test Score? A rubric for ML production systems. In Reliable Machine Learning in the Wild — NIPS 2016 Workshop. 2016.

⁽³⁾ D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison. Hidden Technical Debt in Machine Learning Systems. NIPS 2015. 2015.

Deep Learning: In Production

1. Entanglement

2. Unstable Data

3. Hidden Feedback Loops

Conclusion

Written by Tom Farrand