Managing Your First Data Science or Machine Learning Project

Melissa Thorne
Zulily Tech Blog
Published in
15 min readJul 3, 2018

Introduction

When I got started managing software development projects the standard methodology in use was the Waterfall, in which you attempted to understand and document the entire project lifecycle before anyone wrote a single line of code. While this may have offered some benefit for coordinating large multi-team projects, more often than not it resulted in missed deadlines, inflexibility, and delivering systems that didn’t fully achieve their business goals. Agile has since emerged as the dominant project methodology, addressing many of the Waterfall’s shortcomings. Agile recognizes that it’s unlikely we’ll know everything up front, and as such is built on an iterative foundation. This allows us to learn as we go, regularly incorporate stakeholder feedback, and to avoid failure.

For Data Science and Machine Learning (DS/ML) projects, I’d argue that an iterative approach is a necessary, but not sufficient for a successful project outcome. DS/ML projects are different. And these differences can fly below the traditional project manager’s radar until they pop up late in the schedule and deliver a nasty bite. In this blog post I’ll point out some of the key differences I’ve seen between a traditional software development project and a DS/ML project, and how you can protect your team and your stakeholders from these hidden dangers.

Project Goal

At a high level DS/ML projects typically seek to do one of three things: 1) Explain; 2) Predict; or 3) Find Hidden Structure. In the first two we are predicting something, but with different aims. When we are tasked with explanation we use ‘transparent’ models, meaning they show us directly how they arrive at a prediction. If our model is sufficiently accurate, we can make inferences about what drives the outcome we’re trying to predict. If our goal is simply getting the best possible prediction we can also use ‘black box’ models. These models are generally more complex and don’t provide an easy way to see how they make their predictions, but they can be more accurate than transparent models. When the goal is to find hidden structure, we are interested in creating groups of like entities such as customers, stores, or products, and then working with those groups rather than the individual entities. Regardless of the immediate goal, in all three cases we’re using DS/ML to help allocate scarce organizational resources more effectively.

When we want to explain or predict we need to explicitly define an outcome or behavior of interest. Most retail organizations, for example, are interested in retaining good customers. A common question is “How long will a given customer remain active?” One way to answer this is to build a churn model that attempts to estimate how likely a customer is to stop doing business with us. We now have a defined behavior of interest: customer churn, so we’re ready to start building a model, right? Wrong. A Data Scientist will need a much more precise definition of churn or risk building a model that won’t get used. Which group of customers are we targeting here? Those who just made their first purchase? Those who have been loyal customers for years? Those who haven’t bought anything from us in the last 6 months? High value customers whose activity seems to be tapering off recently? Each of these definitions of ‘at risk’ creates a different population of customers to model, leaving varying amounts of observed behavior at our disposal.

By definition we know less about new customers than long time loyal customers, so a model built for the former use case will probably not generalize well to the latter group. Once we’ve defined the population of customers under consideration, it’s really important to compute the ‘size of the prize’.

Size of the Prize

Say we build a churn model that makes predictions that are 100% accurate (if this were to happen, we’ve likely made a modeling mistake — more on that later), and we apply that model to the customer audience we defined to be at risk of churn. What’s the maximum potential ROI? Since our model can correctly distinguish those customers who will churn from those that won’t, we’d only consider offering retention incentives to those truly at risk of leaving. Maybe we decide to give them a discount on their next purchase. How effective have such interventions been in the past at retaining similar customers?

If a similar discount has historically resulted in 2% of the target audience making an incremental purchase, how many of todays at risk customers would we retain? If you assume that each retained customer’s new order will be comparable to their average historical order amount, and then back out the cost of the discount, how much is left? In some cases, even under ideal conditions you may find that you’re targeting a fairly narrow slice of customers to begin with, and the maximum ROI isn’t enough to move forward with model-based targeting for a retention campaign. It’s much better to know this before you’ve sunk time and effort in to building a model that won’t get used. Instead, maybe you could build a model to explain why customers leave in the first place and try to address the root causes.

Definition of Success

Many times there is a difference in the way you’ll assess the accuracy of a model and how stakeholders will measure the success of the project. DS/ML practitioners use a variety of model performance metrics, many of which are quite technical in nature. These can be very different from the organizational KPIs that a stakeholder will use to judge a project’s success. You need to make sure that success from a model accuracy standpoint will also move the KPI needle in the right direction.

Models are built to help us make better decisions in a specific organizational context. If we’re tasked with improving the decision making in an existing process, we need to understand all the things that are taken into account when making that decision today. For example, if we build a model that makes recommendations for products based on previous purchases, but fail to consider current inventory, we may recommend something we can’t deliver. If our business is seasonal in nature, we may be spot on with our product recommendation and have plenty on hand, but suggest something that’s seasonally inappropriate.

Then there is the technical context to consider. If the goal will be making a recommendation in real time in an e-commerce environment, such as when an item is added to a shopping cart, you’ve got to deliver that recommendation quickly and without adding any friction to the checkout process. That means you’ll need to be ready to quickly supply this recommendation for any customer at any time. Models are usually built or ‘trained’ offline, and that process can take some time. A trained model is then fed new data and will output a prediction. This last step is commonly called ‘scoring’. Scoring can be orders of magnitude faster than training. But keep in mind that some algorithms require much more training time than others. Even if your scoring code can keep up with your busiest bursts of customer traffic, if you want to train frequently — perhaps daily so your recommendations take recent customer activity into account — the data acquisition, preparation, and training cycle may not be able to keep up.

Some algorithms need more than data, they also require supplying values for ‘tuning parameters’. Think of these as ‘knobs’ that have to be dialed in to get the best performance. The optimal settings for these knobs will differ from project to project, and can vary over time for the same model. A behavioral shift in your customer base, seasonality, or a change in your product portfolio can all require that a model be retrained and retuned. These are all factors that can affect the quality of your model’s recommendations.

Once you have clearly defined the desired outcome, the target audience, the definition of success from both the KPI and model accuracy perspectives, and how the model will be deployed you’ve eliminated some of the major reasons models get built but don’t get used.

Model Inputs and Experimentation

In traditional software development, the inputs and outputs are usually very well defined. Whether the output is a report, an on-line shopping experience, an automatic inventory replenishment system, or an automobile’s cruise control, we usually enter into the project with a solid understanding of the inputs and outputs. The desired software just needs to consume the inputs and produce the outputs. That is certainly not to trivialize these types of projects; they can be far more difficult and complex than building a predictive model.

But DS/ML projects often differ in one key respect: while we’ll usually know (or will quickly define) what the desired output should be, many times the required inputs are unknown at the beginning of project. It’s also possible that the data needed to make the desired predictions can’t be acquired quickly enough or is of such poor quality that the project is infeasible. Unfortunately these outcomes are not often apparent until the Data Scientist has had a chance to explore the data and try some preliminary things out.

Our stakeholders can be of immense help when it comes to identifying candidate model inputs. Data that’s used to inform current decision making processes and execute existing business rules can be a rich source of predictive ‘signal’. Sometimes our current business processes rely on information that would be difficult to obtain (residing in multiple spreadsheets maintained by different groups) or almost impossible to access (tribal knowledge or intuition). Many algorithms can give us some notion of which data items they find useful to make predictions (strong signal), which play more of a supporting role (weak signal) or are otherwise uninformative (noise). Inputs containing strong signal are easy to identify, but the distinction between weak signal and noise is not always obvious.

Ghost Patterns

If we’re lucky we’ll have some good leads on possible model inputs. If we’re not so lucky we’ll have to start from scratch. There are real costs to including uninformative or redundant inputs. Besides the operational costs of acquiring, preparing, and managing inputs, not being selective about what goes into an algorithm can cause some models to learn ‘patterns’ in the training data that turn out to be spurious.

Say I asked you to predict a student’s math grade. Here’s your training data: Amy gets an ‘A’, Bobby gets a ‘B’, and Cindy gets a ‘C’. Now make a prediction: What grade does David get? If that’s all the information you had, you might be inclined to guess ‘D’, despite how shaky that seems. You’d probably be even less inclined to hazard a guess if I asked about Mary’s grade. The more data we put into a model the greater the chance that some data items completely unrelated to the outcome just happen to have a pattern in the training data that looks useful. When you try to make predictions with a data set that your model didn’t see during training, that spurious pattern won’t be there and model performance will suffer.

To figure out which candidate model inputs are useful to keep and which should be dropped from further consideration, the DS/ML practitioner must form hypotheses about and conduct experiments on the data. And there’s no guarantee that there will be reliable enough signal in the data to build a model that will meet your stakeholder’s definition of success.

Time Travel

We hope to form long, mutually beneficial relationships with our customers. If we earn our customers’ repeat business, we accumulate data and learn things about them over time. It’s common to use historical data to train a model from one period of time and then test its accuracy on data from a subsequent period. This reasonably simulates what would happen if I build a model on today’s data and use the model to predict what will happen tomorrow. Looking into the past like this is not without risk though. When we reach back into a historical data set like this, we need to be careful to avoid considering data that arrived after the point in business process at which we want to make a prediction.

I once built a model to predict how likely it was that a customer would make their first purchase between two points in time in their tenure. To test how accurate my model was I needed to compare it to real outcomes, so that meant using historical data. When I went to gather the historical data I accidentally included a data item that captured information about the customer’s first purchase — something I wouldn’t know at the point in time at which I wanted to make the prediction. If a customer had already made their first purchase at the time I wanted make the prediction, they wouldn’t be part of the target population to begin with.

The first indication that I had accidentally let my model cheat by peeking into the future was that it was 100% accurate when tested on a new data set. That generally doesn’t happen, at least to me, and least of all on the first version of a model I build. When I examined the model to see which inputs had a strong signal, the data item with information from the future stood out like a sore thumb. In this case my mistake was obvious, so I simply removed the data item that was ‘leaking’ information from the future and kept going. This particular information leak was easy to detect and correct, but that’s not always the case. And this issue is something that can bite even stakeholders during the conceptualization phases of projects, especially when trying to use DS/ML to improve decision making in longer running business processes.

Business processes that run over days, weeks, or longer typically fill in more and more data as they progress. When we’re reporting or doing other analysis on historical data, it can be easy to lose sight of the fact that not all of that data showed up at the same time. If you want to use a DS/ML capability to improve a long running business process, you need to be mindful of when the data actually becomes available. If not, there’s a real risk of proposing something that sounds awesome but is just not feasible.

Data availability and timing issues can also crop up in business processes that need to take quick action on new information. Just because data appears in an operational system in a timely fashion, that data still has to be readied for model scoring and fed to the scoring code. This pre-scoring data preparation process in some cases can be computationally intensive and may have its own input data requirements. Once the data is prepared and delivered to the scoring process, the scoring step itself is typically quick and efficient.

Unified Modeling Language (UML) Sequence & Timing Diagrams are useful tools for figuring out how long the end to end process might take might take. It’s wise to get a ballpark estimate on this before jumping into model building.

Deploying Models as a Service

Paraphrasing Josh Wills, a Data Scientist is a better programmer than most statisticians, and a better statistician than most programmers. That said, you probably still want software engineers building applications and machine learning engineers building modeling and scoring pipelines. There are two main strategies an application can use to obtain predictions from a model: The model can be directly integrated into the application’s code base or the application can call a service (API) to get model predictions. This choice can have a huge impact on architecture and success of an ML/DS project.

Integrating a predictive model directly into an application may seem tempting — no need to stand up and maintain a separate service, and the end-to-end implementation of making and acting on a prediction is in the same code base, which can simplify troubleshooting and debugging. But there are downsides. Application and model development are typically done on different cadences and by different teams. An integrated model means more a complicated deployment and testing process and can put application support engineers in the awkward position of having to troubleshoot code developed by another team with a different set of skills. Integrated models can’t easily be exposed to other applications or monitoring processes and can cause application feature bloat if there’s a desire to include a capability to A/B test competing models.

Using a service to host the scoring code gets around these issues, but also impacts the overall system architecture. Model inputs need to be made available behind the API. At first blush, this may seem like a disadvantage — more work and more moving parts. But the process that collects and prepares data for scoring will often need to operate in the background anyway, independent of normal application flow.

Exposing model predictions as a service has a number of advantages. Most importantly, it allows teams work more independently and focus on improving the parts of the system that best align with their skill sets. A/B testing of two or more models can be implemented behind the API without touching the application. Having the scoring code run in its own environment also makes it easier to scale. You’ll want to log some or all of the predictions, along with their inputs, for off-line analysis and long-term prediction performance monitoring. Being able to identify the cases where the model’s predictions are the least accurate can be incredibly valuable when looking to improve model performance.

If a latter revision of the model needs to incorporate additional data or prepare existing data in a different way, that work doesn’t need to be prioritized with the application development team. Imagine that you’ve integrated a model directly into an application, but the scoring code needs to be sped up to maintain a good user experience. Your options are much more limited than if you’ve deployed the scoring code as a service. How about adding in that A/B testing capability? Or logging predictions for off line analysis? Even just deploying a retuned version of an existing model will require cross-team coordination.

Modeling within a Service Based Architecture

The model scoring API is the contract between the DS/ML & application development teams. Changing what’s passed into this service or what’s returned back to the application (the API’s ‘signature’) is a dead giveaway that the division of responsibilities on either side of the API was not completely thought through. That is a serious risk to project success. For teams to work independently on subsystems, that contract cannot change. A change to an API signature will require both service producers and service consumers to make changes and will often result in a tighter coupling between the systems — one of the problems we’re trying to avoid in the first place with a service-based approach. And always keep the number of things going into and coming out that API to a bare minimum. The less the API client and server know about each other the better.

Application development teams may be uneasy about relying on such an opaque service. The more opaque a service is the less insight the application team has into a core component of their system. It may be tempting to include a lot of diagnostic information in the API’s response payload. Don’t do it. Instead, focus your efforts on persisting this diagnostic information behind the API. It’s fine to return a token or tracing id that can be used to retrieve this information through another channel at a later point in time, just keep your API signature as clean as possible.

As previously discussed, DS/ML projects are inherently iterative in nature and often require substantial experimentation. At the outset we don’t know exactly which data items will be useful as model inputs. This presents a problem for a service-based architecture. You want to encourage the Data Scientist to build the best model they can, so they’ll need to run a lot of little experiments, each of which could change what the model needs as inputs. So Machine Learning engineers will need to wait a bit until model input requirements settle down enough to the point where they can start to build data acquisition and processing pipelines. But there’s a catch: Waiting too long before building out the API unnecessarily extends timelines.

So how do we solve this? One idea is to work at the data source rather than data item level. The Data Scientist should quickly be able to narrow down the possible source systems from which data will need to acquired, and not long after that know which tables or data structures in those sources are of interest. One useful idiom from the Data Warehousing world is “Touch It, Take It”. This means that if today you know you’ll need a handful of data items from a given table, it’s better to grab everything in that table the first time rather than cherry-pick your first set and then having to open up the integration code each time you identify the need for an additional column. Sure, you’ll be acquiring some data prospectively, but you’ll also be maintaining a complete set of valuable candidate predictors. You’ll thank yourself when building the next version of the model, or different model in that same domain, because the input data will already be available.

Once the Data Scientist has identified the desired tables or data structures, you’ll have a good idea of the universe of data that could potentially be made available to the scoring code behind the API. This is the time to nail down the leanest API signature you can. A customer id goes in and a yes / no decision and a tracing token comes out. That’s it. Once you’ve got a minimal signature defined freeze it — at least until the first version of the model is in production.

Business Rules First

Predictive models are often used to improve on an existing business process built on business rules. If improving a business rule based process, consider the business rules as version zero of the model. They establish the baseline level of performance against which the model will be judged. Consider powering the first version of the API with these business rules. The earlier the end-to-end path can be exercised the better, and having a business rule based version of the model available as a fallback can provide a quick way to roll back the first model without rolling back the architecture.

In Closing

In this post I’ve tried to highlight some of the more important differences I’ve experienced between a traditional software development project and a DS/ML project. I was fortunate enough to be on the lookout for a few of these, but most only came to light in hindsight. DS/ML projects have enough inherent uncertainty; hopefully you’ll be able to use some of the information in this post to avoid some of these pitfalls in your next DS/ML project.

Originally published at https://zulily-tech.com on July 3, 2018.

--

--