The Data Science Research-Production Chasm

Published in

Root Data Science

8 min readSep 8, 2021

Tell me if this situation sounds familiar: You’ve shown in a proof-of-concept setting that your new ML model will be effective, but after a full-scale fit, the performance isn’t quite what you expected. Or maybe after applying feature transformations in a production environment, the resulting features are close, but not exactly what they were in your research environment. Or maybe you’ve successfully validated model predictions on historical data, but when that model is pushed to production, there are suddenly a handful of predictions that differ from the research environment. It should be the same model artifact and basically the same code pipeline! What’s going on here?

Welcome to the chasm between data science research and production.

Now, you may be thinking that these problems already have solutions, like:

Version-controlled code, e.g., Git
ML pipelining framework for defining your process, e.g., Airflow
Cloud storage for critical modeling files, e.g., S3
Team-wide conventions and data integrity tests

But these tools are only as good as how they are implemented, and there are still a host of opportunities for breakdowns within components and at their interfaces. But maybe “that sales rep from SaaS company XYZ showed me last week how their platform solves all of these problems!” (Spoiler alert: It probably won’t, but they’ll be happy to take your money for the privilege.) Unfortunately, the insidious ways that tooling and process failures magnify small discrepancies between research and production are all too common — and often turn cracks into chasms.

To some extent, this is the nature of research. We don’t know a priori if something will work or not, and ideas need to be tested as quickly as possible to get them to market. No one would cast blame for taking judicious shortcuts — after all, this is the idea behind the well-known concept of technical debt. The important part, however, is that this debt eventually gets paid down. In software engineering, ongoing technical debt can result in code that is hard to test and maintain. Here we discuss ways in which data science technical debt can manifest as similar but distinct problems, and how to address them early on.

A motivating example

Let’s take an example from the insurance industry: Suppose a data scientist wants to build a new model to predict insurance losses using a new feature from an external source, and then determine where that model shows an improvement over the existing one.

The existing pipeline transforms several different datasets (e.g. maybe braking events, distracted driving events, etc.) into a “user-aggregated” dataset (X) via a job (f), which feeds into a model (g) that predicts probability of auto accident (y) for that user.

Figure 1: The current state of the world

The new data source (Xj) has information about the vehicle in which the trips were taken, delivered via API requests, and the researcher hopes that this vehicle information may help to improve their hard-braking feature. Therefore, they need that feature available as an input to f. In a perfect world, the new pipeline would look like this:

Figure 2: The *ideal* future state of the world

However, because this feature comes from a new source, the data engineering team informs the researcher that it will be several weeks before they can modify the ETL pipeline to integrate the new feature via the partner’s API. In addition, the feature can only be integrated once a contract has been negotiated with the external provider; as such, the finance team wants a cost-benefit analysis detailing the model improvement to justify the cost. But of course, the value component of this analysis cannot be generated until the feature can be integrated into the ETL pipeline.

To short-circuit this Catch-22, the enterprising data scientist secures access to a modestly sized “evaluation dataset” (Xj’’) with which they seek to assess the model improvement (e.g. how much better y’ is than y). However, this dataset does not come directly from the production API — maybe it was manually scraped from the underlying dataset. Therefore, the researcher comes up with some custom research code in a notebook that pulls in some data from the production environment, and then joins it onto the new evaluation dataset to produce X’’, an estimate of what X’ might eventually be.

Figure 3: The realistic research state of the world

With the evaluation dataset in hand, the researcher uses the combined data to fit a new model, g’’, and soon discovers that the results look great! The improved features significantly increase the accuracy of the model. Obviously the business wants to get this new model into production as quickly as possible, so the API integration gets fast-tracked. There is still quite a bit of effort involved in trying to translate the custom code from the custom analysis script (h’’) into the ETL job (f’), but finally, after several weeks and a couple delays (the stakeholders have already widely disseminated the new-and-improved results in the most recent investor meeting, but no pressure…), a production-ready model is complete. Just before releasing it into the wild, the data scientist performs one last check between the production predictions (y’) and the research results (y’’).

Uh oh. They don’t match up. Why don’t they match up?! In the best case scenario, the performance metrics are “about the same” and the company needs to get the model out the door, so it is just shipped with some trepidation about the discrepancy. In the worst case, the performance metrics have significantly deteriorated and the researcher descends into a weeks-long purgatory of trying to track down bugs and code differences.

What happened? You’ve probably already seen it, but here’s the problem: It was never clear that y’ was an improvement over y, only that y’’ was. And unfortunately, y’’ is not something that can exist in a production environment because it relies on ad-hoc steps strung together by humans, not automated code. Note the difference between the two definitions:

y’ = g’(X’) = g’(f’(X1, … XN, Xj))

y’’ = g’’(X’’) = g’’(h’’(X, Xj’’)) = g’’(h’’(f(X1, … XN), d’’(Xj)))

The improved result y’’ depends explicitly on h’’ (an ad-hoc notebook run on the analyst’s laptop, perhaps using whatever environment, packages, or dependencies they happen to have installed) and d’’ (a custom data-pull, likely generated by a human that may or may not be repeatable).

The example cited above is all too common in industry:

A cycle of quick and exciting research leads to high expectations of great improvement, followed by a long series of delays and disappointments where frustrating integration work fails to recreate those elusive improvements, made all the worse by the feeling of sunk costs and a need to justify the time spent.

In the worst cases, engineers and data scientists become disillusioned by the tedious work and stakeholders feel they were sold on false claims, and will be even less likely to take a gamble on a speculative research project in the future. All of this leads to a lack of velocity and innovation.

What we can do about it

This is a challenging problem. Indeed, many might argue that much of the job of data science is figuring out how to handle uncertain estimates of the world, and those who have a knack for doing this well may be seen as having good “data science sense.” Nevertheless, while there are no silver bullets, here are three general guidelines that can help increase the probability of success.

1. Try to get an early rough read on performance

It might be tempting to say that the answer is to be very meticulous on code quality and analysis from the beginning. But many research efforts won’t pan out, since in most industries, much of the low-hanging fruit has already been picked. Therefore, spending a lot of time perfecting analysis code for every project is likely to lead to a lot of carefully documented null-results.

This may not be the worst thing if you can afford it, but most companies probably want to fail fast and move on to the next idea. You can do this by judiciously cutting some corners and making approximations, for instance:

Maybe the production data is hard to get, but there’s a decent proxy available. This will at least provide an order-of-magnitude estimate.
An overfit model has the advantage that it is likely an upper-bound on real performance. Your model doesn’t look great on in-sample predictions? That’s a good sign it’s probably not worth your time to do a lot more fine tuning.

2. Take the time to get the “research code” right

If a lot of weight is going to be placed on the research findings from stakeholders, the researcher obviously needs to use the proper care to ensure those findings are valid. However, it’s just as important to communicate to those stakeholders that investments are being made now to ensure the integration is seamless later. This includes things like:

Making sure the analysis is done in an environment that is consistent with the production environment (e.g. no mismatched package versions). Tools like common requirements files with pinned dependencies (or even container registries, e.g. AWS ECR) can help mitigate this.
Extracting common code out to libraries when reasonable. If there is a particular metric or technique (e.g. a custom loss function or method for folding datasets for cross-validation), it’s a bad sign if people are “copy-pasta”-ing that code into their notebooks and scripts. A great solution is an internal package repository (these can be made private for proprietary code; Root has built one with jfrog) so everyone can `import util_lib.awesome_func` for their scripts, ensuring standardization.

3. Invest in a transparent ETL pipeline with lots of persistence points

While this can take a lot of work up front, the ability to easily obtain production-grade data at any point in the pipeline (without relying on ad-hoc proxies) greatly increases research fidelity. There will still be the need for legitimate approximations and workarounds in research, but the fewer you have, the less your errors will compound. For instance, in the example above, you’re much better off in research if you have access to both X1, …, XN and X, rather than needing to approximate X with a custom implementation of f because X wasn’t persisted.

A final thought

Many of the proposed approaches above involve a fair amount of “tooling.” If you have a dedicated data engineering team, that’s great — they probably are already working on many of these strategies. If not, it can be daunting for data scientists to take these challenges on. Much of data science education centers around algorithms and modeling, whereas these technologies are often relegated to “just productionizing the model” (which often accounts for the majority of the effort). Furthermore, you may not feel that you have the latitude to make these investments — “getting something working” might be more important.

However, these tools will only become more important in the future, and becoming proficient in them now will surely be a boon to your skill set down the road (Coursera, LinkedIn, and others have tons of resources). Also, if you can point to a past effort that had a slow or rocky deployment and clearly articulate how it could be done better in the future, this will help to get buy-in from stakeholders. Remember, you don’t have to solve every problem right away. Just try to keep chipping away and keep in mind:

Estimate value with early “rough reads” to narrow to the most promising projects
For only those candidates, invest the time to make “research” code high quality
If possible, establish good package abstraction and ETL tooling

In time, you just might find that the right investment can help narrow the all-too-real research-production chasm into a manageable crack.

The Data Science Research-Production Chasm

A motivating example

What we can do about it

A final thought

Written by Joe Plattenburg