Data Science Projects: How You Can Fail Before A Single Line of Code Is Written — and How to Avoid It

Michael Chu
GAMMA — Part of BCG X
10 min readSep 29, 2019

--

Let’s say your company has decided to address a business problem using data science and machine learning. You have a strong team armed with the latest algorithms, and it’s time to let them loose, and “let the data speak for itself”. What could go wrong? As I have seen far too many times, it’s not even close to being that easy. Lack of planning can send even the smartest data scientists in the wrong direction, dooming the data project to failure before even a single line of code is written.

To avoid that fate, there are six potential points of failure to consider when embarking on a data science / machine learning project:

Each question above represents a key set of considerations, that, if done incorrectly, can result in a failed project. Let’s go through some examples for each in order to better understand what it really means to plan well:

Point 1: Business Value

It stands to reason that before beginning a data science project, you would have a clear understanding of how the project will add value to your company. But in the time-honored tradition of “Ready, Fire, Aim,” many companies put the wheels in motion before identifying the destination. To avoid this, determine which KPIs you expect the project to change and the business value of those changes — before you start the project. If your goal is to optimize promotions, ask how much is spent on annual promotions and how many promotions are done to generate incremental sales or margin. If the answer to either is a small number, then there is a significant chance that the cost of the project will outweigh any benefits that might come from it. If that is a likely result, why do the project in the first place?

Look closely at the specific metric you want to improve, and size the impact. Don’t just take an overall number and say “get 5% benefit.” Be as precise as possible. Instead of making a broad statement that you want to improve retail promotions, drill down. Do you want to focus on promotions that aren’t subject to contractual provisions with your vendors?

Or you might want to focus on decisions that typically don’t have much human scrutiny and analysis. Yes, you could use analytics to predict risk of churn for the largest accounts, but if your company’s salespeople talk to these accounts all the time, what is the chance that your model is better than what they know? Instead, you might estimate the value of the project based on the information you will gain about your less-visited, smaller business clients.

Point 2: Business Process Impact

You must also consider what specific decision the results of the data analytics project will change, and whether it will change the information that agents rely on to guide their actions. Let’s say a company wants to address this issue of churn among both large and small clients. The data science team decides to build a model to predict which customers are going to churn in the next 3 months. The results of this model, in turn, will be forwarded to the retention specialists who take customer cancellation calls, and attempt to save them.

This approach makes perfect sense as a data model, but as a business tool it suffers from a basic flaw: By the time these customers have worked their way through to the cancellation department, they have already made it very clear that they want to cancel. As such, the results of the model are not going to help the retention team dissuade customers from leaving. It would be more useful to the team to develop a model that explains why these customers want to cancel. Is the service the company provides too expensive? Is it not what the customers expected or needed? Did the customer get a better offer? To have any utilitarian value, the analytics must help the decision maker in their specific context and domain of control.

This issue of project design failing to align with business goals arose during another project involving consumer promotions. In the early stage of the project, we came across records of a previous company effort to develop a model to predict the impact of these promotions. The model itself was robust and accurate, and it had a great user interface. You could plan promotions online, automatically pull the necessary data and then, with the press of a single button, publish the results for approval.

However, the marketing team actually never used the model, despite its predictive power. Why? Because the promotion tool suffered from three serious process flaws:

1) It did not provide sufficient background evidence or rationale for its recommendations.

2) It did not allow for sufficient user overrides. If, for instance, a business user deliberately planned to drive traffic by planning a loss leader, the tool would predict poor performance.

3) Worst of all, in order to pull the required data from legacy IT systems and generate an estimate, the tool’s rules required that the planned promotion be published, thereby making the plan visible to the planner’s manager. Even if the planner was just experimenting, the manager would still see the work. In cases like this, the tool had the unintended consequence of rewarding positive behavior — exploring potential options to find the best promotion — with unwanted and unnecessary scrutiny.

The underlying point here is that, as tempting as it is to just jump in and start building algorithms, we advise data science teams to observe how decisions are made in “their natural habitat.” That is to say, talk to the people who are going to be the users of the tools you build. Talk to anyone else who will be affected by the use of the tool. Understand their motivations and their frustrations. Understand that what looks like a sub-optimal decision from the math standpoint could be purely rational from the users’ standpoint, so you may need to adjust other parameters such as performance metrics.

Point 3: Data Availability, Quality, and Governance

All models rely on data, but data does not exist in a vacuum. Talented professionals are in charge of managing pipelines, data warehouses, and data lakes. Even so, data can lose its value. Specific data fields might go out of use or no longer be updated. Certain data-hygiene processes might no longer followed. Whatever the cause of the data degradation, models based on that data can be destined to failure. Before starting any data-based project, make sure you are fully aware of that data’s quality, completeness, and update frequency, as well as of any anticipated changes to the database.

We once worked on a project in which a certain customer data point was very predictive and led to outstanding model performance. Then we realized that the salespeople assigned the final value to the data only after the customer signed the contract. Prior to that step, the salespeople would assign a random value (often the default) to the data field. Had that model and its faulty data been implemented live, there is no predicting how or how many customers might have been affected.

There are two ways to guard against this:

1. Get help: There is often a data professional, typically in business intelligence or finance, who understands the “watch-outs” in the data. You need this person on your side, working with you throughout the process. Most outliers and anomalies are usually not mistake: They are the outcomes of a very specific process that generated the data.

2. Slow down: Take at least 10 examples, such as 10 customers, and walk through all their data. Build a picture in your mind of what this customer looks like, and then rationalize it. You’d be surprised how often you find things that change your fundamental understanding of the data. For example, we were doing a data science project with an airline when we came across bookings for what seemed like abnormally large families of 20 or more. Upon further research, we discovered that group bookings had found their way into our dataset.

Point 4: Analytical Approach

Before even considering which algorithm to use, you first must determine the objective function of the project. Is the goal to maximize sales or sales per opportunity? Or are you interested in margin or how much lift is needed to drive business value? Once you’ve figured that out, then — and only then — will it be time to discuss algorithms and data models.

For one promotional-effectiveness project we worked on, the initial brief involved building a model to predict sales volume per item. One objective function could have been to create some measure of prediction accuracy by item and week. But what the business really needed was the ability to determine which promotions to run for each specific item. Typically, these promotions would run for a number of weeks, so a better objective function might have been to recommend the right kind of promotion for that item — one that would predict which price would generate the highest incremental profit over the entire promotion period (after taking product economics into account).

Using this metric would have meant that prediction accuracy itself, was far less important. If the model said to run a promotion, and that promotion generated five times the profit relative to doing something else, then accuracy at item and week level would not matter nearly as much as volume lift over the entire promotion. The clearer you are on the function of the model and where it does and does not need to be extremely accurate, the less time your data scientists will spend creating unnecessary capabilities.

Point 5: Match Between Team Skills and Objective

Clearly, if you don’t have the right talent or technical environment for a project, that project is at risk. It is not unusual — and is often expected — for data scientists to do some amount of learning on the job. But sometimes that can go too far. Simply asking a machine-learning engineer to design an experiment or an optimization guru to solve an econometric problem, without knowing anything about their skill level or specialty, is yet another way to put an entire project at risk before it begins.

For optimal results, begin project planning by having conversations with members of the data science team. Find out what they do best, what experiences they’ve had, what they are interested in learning about. You need to know beforehand if they will be able to hit the ground running, or if they will first need months (or more) of training.

We have found in our work that there are a number of data scientist “archetypes.” Some data scientists are predictive modelers. Others are statisticians. Some excel at operations research or are optimization gurus, visualization experts, or natural language processing experts. While many data scientists know a bit of everything (something we highly encourage), most will have their “major.” If possible, avoid assembling a team whose members have only “minored” in the problem area you are addressing

Point 6: Plan for Piloting and Testing

It stands to reason that, before you start building a model, you should have some idea of how to test it. Should testing include cross-validation? Should you run a virtual pilot in which business users get to see and then vet decisions generated by the models? If this is a live pilot, should it be small scale or large scale?

How to test a model can be a function of the “speed of feedback.” In other words, you need to know how long it will be before you know if your models are working, and how quickly can feedback be incorporated into training data. Speed of feedback will tell you how good a model needs to be before testing.

In many situations, it is best to build an MVP, release it, and then continuously improve it. In other situations, it makes more sense to build something more rigorous before releasing it. In developing a sales support tool for a client once, we erred on the side of rigor. We knew that getting the model right in almost every situation was critical for what had become a very change-weary sales team. If we had provided something that was “mostly right” but suffered from performance issues on even a small percentage of accounts, we knew we would lose the team’s trust and confidence. We also knew that if we lost their trust it would take a great deal of effort to convince them to give us another chance. Imagine you’re using Excel, and every now and then it adds 1 + 1 and gets 3. How likely are you to trust that spreadsheet again? But if a retailer marketing campaign sends you an offer for one type of clothing, but you really prefer another type, that’s not a big issue.

The best way to plan your approach to testing is to talk to the potential users of the tool before you start building it. These are your customers, so it behooves you to get to know them a bit. Find out what their attitude is toward adopting new tools. Maybe you can identify some “lead testers” who are willing to use an early prototype even if it has a few bugs or inaccuracies. Maybe you can find people who are comfortable using a less-that-fully polished UI. A little bit of time spent in these conversations can save you a lot of time later on.

Think First, Then Plan Accordingly

As we have seen time and time again, the above points of failure can derail a data science project before one line of code is written. By being aware of these points and testing each of your projects against them, you can be much more certain that your project starts out with a well-considered destination in mind. These six points, along with the following questions, can help you assess how prepared you really are to initiate a data project.

As we suggested at the beginning of this article, it is tempting to assign almost supernatural powers to data scientists. The results some data models have been able to achieve do, in many cases, seem almost magical. But there is more to data & analytics than just, well… data & analytics.

Even the best data scientists need to work closely with business users to make sure that what they imagine as the best data solution delivers clearly defined, measurable business value to the end users. If it doesn’t, what was the point of the project? You wouldn’t write software or build a business without working closely with all stakeholders to create a clear plan. Why would you approach a data science project any differently?

For Reference: A Checklist before you begin

--

--