Decision Quality for Data Scientists: Lessons from Decision Professionals

Six principles that organize what I’ve learned over the years building models.

Published in

Decision Analysis

16 min readMay 4, 2020

Source: Climate Change Risk Analysis for Projects in Kenya and Nepal Van der Vat, M.P., J.E. Hunink, D. Stuparu — Photo credit: Climate Change Risk Analysis for Projects in Kenya and Nepal

Decision Quality systematizes the best practices for modeling decisions, which are practices that the field of Data Science sorely needs to adopt.¹

Introduction

Let’s say you are a Data Scientist at your first meeting with your client. Or you are a software engineering or program manager thinking of starting a project where you can apply what you know of Machine Learning (ML) or Artificial Intelligence (AI). Perhaps analysts on your team suggested a project based on their preliminary data analysis and the promise of almost unlimited data. Now people are looking to you to know where to start, how to model the situation, and how to apply automation that achieves business value.

Where do you start? What questions do you ask? Bringing together analysis of business decision making and Data Science, this article draws on principles that have been shown to improve Decision Quality in practice. Decision Quality codifies the experience of decision professionals who advise on decisions that place companies at risk, span millions of dollars, and span multiple years. It’s based on principles derived from Decision Theory, of techniques to quantify and improve decision making. This article extends the Decision Quality framework to Data Science practice by focusing on the decisions to be automated.

Data Science should start with identifying decisions, as one of five aspects, as implemented by the Decision Quality framework.

Current Data Science has demonstrated notable successes driven by data-rich online services for recommendation, search, and the like. And this is just the beginning. But extending Data Science to new applications is hampered by an incomplete theory. As the name of the field implies, the focus has been to chase after sources of data with the presumption that past practice forms an adequate precedent for new applications. The common myth that the Data Science process starts with the data is a non-starter. It starts instead with identifying the decision, as one of five aspects, as implemented by the Decision Quality framework.

There are two ways in which decision-making principles can be applied to Data Science. The first is to decide on a project to undertake an automated application. “What is the automated application to build?” The other is to incorporate Decision Quality principles within an automated software system. Automated decisions occur numerous times — -at “internet” speed — -and that’s the theme of this article— and what Data Science should really be about.

Beginning with The Essence of a Decision

So back to day-zero when you’re standing in front of a blank whiteboard with your client.

Source: https://en.wikipedia.org/wiki/Decision_tree — Example of group facilitation for framing decisions.

One may have a presumption that there’s a data-rich problem to be automated, but before one starts working on a data set or even searches out available data, framing the problem should begin by examining the set of choices or possible actions comprising the “essence of the decision.² ” A clearly specified decision leads to a course of action that drives model design and implementation.

What is a decision?

A decision implies
1. A tangible change — something moved or modified; a resource committed,
2. At a point in time,
3. By a decision-maker, typically identified with a person,
4. That can be defined by a set of uncertainties and consequences.

A decision’s choices form an “either-or,” and when resources are allocated, reversing a decision would incur some cost. Starting analysis with the decision, as opposed to say the prediction, clarifies the analysis, as we shall see. We concentrate on what Economists call “single-person decision theory,” which implies trade-offs need to be made based on resource constraints, uncertain outcomes, and values. We don’t consider the extension when a decision-maker faces an adversary where it becomes instead an application of game theory.

Decisions often gain attention when there’s a fear that something is going — -or is about to — -go wrong. But better yet, considering circumstances creatively may suggest alternatives before a crisis arises. The framework present here can be applied to any decision, but it only makes sense to invest time and effort in analysis when the consequences justify the automation effort.

Aspects of formulating a decision and what pitfalls they avoid

Succinctly, Decision Quality means applying these principles:

1. Solve the right problem. Frame a scope that encompasses the problem fully. Don’t confuse analysis of the data-set with analysis of the actual situation.

2. Consider a true range of alternatives. For instance, a model intended just to advocate for certain outcomes will overlook valid alternatives and be subject to bias.

3. Search out curated and relevant information, not only as data, but including the proper use of judgment combined with data.

4. Value consequences of decision outcomes, making trade-offs explicit, bounded by ethical constraints. Incomplete optimization of trade-offs can lead to models with perverse, unintentional consequences.

5. Apply solid reasoning. Use methods that de-bias information. Similarly, be able to measure “signal” versus “noise” so that consequences can be evaluated based on probabilities of outcomes. Models that optimize predictive accuracy and ignore probability of error can lead to wrong choices.

6. Implement what has been decided. Entering into analysis without the means and commitment to follow through doesn’t achieve anything.

From C. Spetzer (2016), *Decision Quality.*

How things go wrong

Of course, all of this sounds obvious! How could one deny any one of these six principles? For the longest time, I have applied them implicitly, only to be greeted by the uncomprehending looks of my fellow team members. The benefit of a shared framework is to foster a common language and get everyone on the same track.

The principles of Decision Quality, by bringing decision theory to practice, reveal flaws in decision making in the large and the small. The pressures that impede successful software automation are the same that underlie failures in major organizational decisions. We can learn from how well-intentioned, “rational³ ” persons managed to make bad decisions on a global scale — take the case of the Cuban Missile Crisis. It began when the U.S. discovered intermediate-range ballistic missile sites in Cuba just off the U.S. coast. Both the young, inexperienced U.S. President and a Soviet military bureaucracy blundered along, from one bad decision to another, each fearing the other would start an unintentional nuclear war. The Soviets had wanted to address an imbalance they perceived in the “missile gap.” The U.S. response to Cuba obtaining armaments that threatened the U.S was to run a naval blockade. The Soviets saw the blockade justifiably as an act of war.

Organizational theory explains why the Cuban Missile Crisis evolved in ways that were not rational.

Why did both sides act in ways that were obviously inconsistent with their ultimate desires to contain the conflict? Both sides’ leadership suffered from the military organizations operating “by the book” that avoided short-term organizational risks but raised the risk of the situation spinning out of control. Organizational theory explains why the Crisis evolved in ways that were not rational. Russian Intelligence, typically expert at covert action, unwittingly delegated construction of missile installations, resulting in sites that were visible and easily identifiable from overflights. Subsequently, the U.S. Navy ran the blockade on the high seas rather than in coastal waters, preserving its tactical advantage, but aggravating the risk of inciting escalating military confrontation that the U.S. leadership desperately wanted to avoid. In both cases, there was a complete disconnect between desired outcomes and actions, poor use of intelligence, poor consideration of viable alternatives, and implementation of inappropriate policies.

For thirteen days both sides remained at a standoff, at the same time desperately hoping to avoid escalation, until, in short, both sides made concessions and backtracked from their course of action. The Crisis is studied as a textbook case of organizational decision making. For those curious, a detailed analysis is offered by The Essence of Decision [Allison, 1999], which uses the Crisis to illustrate different models of decision making. Without going into too much detail here, Allison starts with the same rational model, borrowing from decision theory — whose origins are identical to the basis for Decision Quality. Allison develops next an organizational theory model which explores the limitations of organizations as they “satisfice³ ” by reacting with a standard repertoire of operating procedures to circumstances. The bounds on rationality⁴ that cause organizations to fail are in principle the same as occur in complex automated systems built into software.

Automating Decisions with Software

The question is how to make the automated application more rational — to incorporate the Principles into automation itself.

We can draw on the lessons of organizational decision-making to reveal strong parallels with the challenges of automating decisions that plague Data Science. At one level the implementation of Data Science projects can be analyzed as just one more example of an organizational process: do I incorporate such-and-such an automated application into the business or not? This implies applying Decision Quality to business choices as it is applied in conventional areas. At a different level, the more interesting question for Data Science is to approach intelligent software as a problem of how to automate decisions. “Intelligence”, specifically in the sense of Artificial Intelligence, has the goal of acting rationally⁵. Hence better Decision Quality guides one how to create more intelligent applications. Fortunately for the sake of this analysis, decision-making and its foibles are universal. We can make some initial observations of what organizational failures imply about decision making failures in general, and how they might apply to decisions made by software systems.

1. Just as decisions made in organizations are subject to time and resource constraints, software systems are constrained by inadequate alternatives, information, and lack of consideration of decision consequences.

2. Systems, similar to individuals, fall into patterns of re-using techniques (think off-the-shelf software algorithms) whose properties are well known rather than exploring the problem fully for an appropriate solution.

3. The consequence of bounded rationality is that systems are designed to down-play uncertainty, limit search, and sub-optimize toward immediate objectives.

In short, “premature optimization⁶ ” and improper risk avoidance are endemic also in Data Science applications. Such failings are not new to system design; however, the opportunities Data Science creates for more intelligent automation opens a new field of application.

Object Lessons from “The Wild”

The following case studies of data science projects applied to business problems I’ve experienced in my professional life illustrate where questions of Decision Quality arose, and had they been raised, would have made a difference.

Object Lesson 1: “On Time and In Full.” Is that the right problem?

As a Data Science team, we were engaged by a C-level executive at a goods manufacturing firm. He had raised concerns about one of his remote manufacturing plant’s ability to deliver against orders on-time and in full (OTIF), as indicated by data collected in the plant’s MRP system — or so we thought. His request — the model framing we adopted — was to see how well we could use state-of-the-art ML tools to predict OTIF based on the orders received. Investigation revealed that the accuracy of the OTIF numbers predicted by the model stood at a modest 60%. In part this was due to the negligible variation in the dependent OTIF variable; there was hardly any effect to predict. This also was because the prediction was made on features available only up to the time the order was received, and much of the variability in observed OTIF was due to variation in the manufacturing process after receipt of the order.

Further investigation showed that OTIF variation was more often negative than positive; that is, goods often were delivered before the promised order delivery date! As a consequence of manufacturing’s desire to avoid the risk of missing deadlines, manufacturing runs were scheduled immediately on receipt of the order, then goods kept in inventory until needed. Consequently, inventories grew out of control, taxing storage capacity. Put into perspective, this overemphasis on OTIF prediction while overlooking inventory holding costs obscured the actual decision trade-off. Fortunately once revealed, this finding was welcomed by the client.

This case highlights the lack of application of several Decision Quality principles. When swimming through large quantities of data it’s important to “reach bottom” and discover how the data is grounded in actual physical, business processes. Had we scrutinized the entire manufacturing process initially, we might have begun instead by (1) solving the right problem. This would have been apparent had we canvassed the organization to understand the (4) values attached to consequences of the manufacturing process. Then we could have traced consequences back to the relevant decisions: In principle outcomes point to the decisions that drive those outcomes and hence must be included in the model. In this case, the (2) true decision alternatives are when to schedule production, which determines the trade-off between holding inventory and meeting OTIF deadlines.

It is likely that had we instead learned a model that determined the trade-off between inventory-holding and OTIF , we could have used inventory holding times and quantities — the (3) relevant information given this decision — to produce both a more accurate and relevant model.

Any problem worth analysis at its core involves evaluating trade-offs, so —

(1) solving the right problem means including the consequences of trade-offs. Furthermore since inaccuracy is part of the nature of any statistical prediction, quantification of risks due to inaccuracies determines what decisions should be made. The risks determine the trade-offs. A solitary focus on accuracy is not an application of (5) apply sound reasoning, since it overlooks the probability of outcomes that will be inevitably inaccurately predicted.

A confusion matrix tabulates the probabilities of predicted versus actual outcomes. Cases where predicted and actual correspond fall on the diagonal and the count of these entries indicates the model accuracy. Off-diagonal entries indicate errors. The confusion matrix for a binary prediction has four outcomes, like this:

A Confusion Matrix shows the four outcomes of a binary prediction

This simple comparison shows the difference between accuracy and a risk-based value function as an objective. Consider evaluating a binary classifier’s confusion matrix’s four entries, “true positive” (TP), “true negative” (TN), “false positive” (FP), and “false negative” (FN). As mentioned, accuracy counts only the true elements and ignores the errors. Assuming correct predictions incur zero cost, properly one should evaluate value, v, by assigning costs specific to the two error terms:

where FP is the count of spurious predictions — false positives, FN is the count of missing predictions — false negatives. cs is the unit cost of a spurious prediction, and cm is the unit cost of a missed prediction. Note in the OTIF case how costs differ, of missing one order delivery and having to store extra stock to assure the delivery is met.

This example also highlights the distinction between an operational Key Performance Indicator (KPI), in this case, OTIF, and the eventual business goals that a model needs to incorporate. So pursuing OTIF as the dependent variable was “barking up the wrong tree.”

In conclusion a thorough business analysis — direct discussions with parties involved — could have set a proper scope before commencing on predictive approaches. With the consolation that the client gained valuable insights from the analysis, had all the facts been available at the outset of the work and had the cited Decision Quality principles been applied at that time, the work might have resulted in a better model.

Object Lesson 2: Sales Opportunity Propensities: What to do next?

“Opportunity Scoring” applies an ML classifier (think logistic regression) that assigns a score to predict closing a sales opportunity. It promiscuously incorporates features of the deal(dollar amount, product, etc.), the customer, the sales team, and the business climate, and anything else that improves accuracy. These opportunities are characteristically large deals for enterprise customers, involving the efforts of the sales team over the course of months to sometimes years. In short, a score near one implies an opportunity likely to win compared to a low score, near-zero, implying a lost sale.

I’ve been a member of Data Science projects attempting quarter-to-quarter sales Opportunity Scoring that achieve remarkable accuracy across sales portfolios of large multinational companies . The data covers thousands of opportunities using hundreds of features. Despite the claims of accuracy and comprehensive approach, the usefulness of these models is suspect, since no attention was paid to how the prediction applies to the actions of the sales team. Properly, an appeal to Decision Quality starts by asking what actions — (2) consider a valid range of alternatives — should be modeled.

In one case, almost all the “opportunities” considered were routine subscription renewals that are most often extended by the customer with little effort by the sales team. The high overall accuracy of renewal predictions compared to new sales obscures the predictive accuracy on the smaller number of new accounts, which attracts the majority of sales team efforts, and where most value of the model is. By not breaking out new sales separately most of the model’s usefulness is lost.

But speaking of usefulness, more importantly, the idea of Opportunity Scoring fundamentally confounds prediction of the outcomes modeled, in this case the binary sales outcome “succeeds or fails” with informing available alternatives. A useful model’s predictions should indicate which choice leads to the best-expected outcome, rather than just predicting the “highest scoring” outcome based on recent history. Is it best that the sales team contact the client, in person, or by email at this time? Should they engage their legal or IT resources to advance the deal? And as for (1) proper framing of the problem, this implies also modeling temporal aspects. For example depending on the status of a sale, it may be better to postpone efforts to close the deal until next quarter, to not eat one’s seed corn, and instead to focus on those opportunities near completion. Predictions not predicated on conditions specific to the sale are pointless: If the model would predict a high opportunity score, does that imply less effort is necessary, since the sale would close anyway, or does it imply the need for more effort, to “make the prediction come true”?

Considering the right alternatives makes it possible to identify what qualifies as (3) curated and relevant information i.e. model features for these alternatives. With the huge number of available features that one can conceivably add to one’s data set, one should be cautious about including those that would empirically improve prediction, but which no argument of relevance can be made. The hazard is that a prediction’s accuracy may be driven by a spurious correlation that is not guaranteed to persist when the model is put into use. This is roughly what happened with the Sales Opportunity model: Scores were best predicted by the features of the sales team’s makeup assigned to the deal. When model predictions failed subsequently in later quarters, it was evident that the sales team variable was a surrogate for the strategy behind sales quota setting for that team. The efficacy of teams depended on how they set their sales quotas. That did not appear in the model, but caused quarter to quarter discrepancies and the model failed to generalize.

A Final word about Implementation

We haven’t said much about commitment to making the right choice: (6) Implement what has been decided. This is where automated decisions differ from human-implemented decisions since, clearly an automated decision is not in a person’s hands. Automation raises engineering challenges that often require orders of magnitude more effort than the Data Science modeling they implement. Building an “end-to-end” system is where most work occurs. The determinants of engineering success are out of scope for this article, but caveats apply. Ethical concerns arise when a decision is automated that should take into account human judgment, and this is covered by (4) properly valuing consequences.

Do we need a new term for “Data Science”?

Armed with this new set of principles we can carve out a new approach to the design and formulation of data-driven automation. Borrowing from the terms “Artificial Intelligence” and “Decision Analysis” suggests Decision Intelligence.⁷ I’m using the term specifically to mean incorporating Decision Theory in the practice of Data Science practice, to create intelligent data-driven systems. Just as the decision to automate is subject to Decision Analysis, the same fundamental principles apply to decisions incorporated in data-driven automation. The arguments I’ve given here will sound old-hat to my Decision Analysis colleagues, albeit dressed up in new terminology. In the end, the result is a comprehensive method based on Decision Quality for Data Science to advance, mature, and broaden as a field.

Endnotes

[1] Thanks to discussions at DAAG 2019, and various teams at Microsoft: Particular acknowledgments are due to Dave Matheson, Brad Powley, Somik Raha, Tomas Singliar, and Carl Spetzler, for contributions, valuable comments, and guidance. The inspiration for this article comes from C. Spetzler, H. Winter, J. Meyer. (2016). Decision Quality: Value Creation from Better Business Decisions, NJ: Wiley.

[2] We borrow the term from the similarly-titled book, G. Allison, P. Z. (1999). Essence of Decision 2nd Ed. NY: Pearson, on theories of decision making in public policy. Some may find it far-fetched to link this to our current automation concerns, however this shows how universal these principles are.

[3] Bounded rationality results in “satisficing.” The term satisficing, invented by H. Simon implies finding a satisfactory, if non-optimal solution of a decision maker with limited resources and adversity to risk, subject to the constraints of a bureaucratic hierarchy. Satisficing is a consequence of bounded rationality. He introduced the term in the ’50s. As for how it applies to AI, I recommend looking at this recently re-issued version of his seminal text: H. A. Simon, (2019). The Sciences of the Artificial. MA: MIT Press.

[4] To quote H. Simon, [Bounded] Rationality denotes a style of behavior that is appropriate to the achievement of given goals, within the limits imposed by given conditions and constraints. The author goes on to explain why bounded rationality applies equally to behavior of individuals and of organizations.See H. Simon (1972) “Theories of Bounded Rationality”, Ch 8 in C.B. McGuire & R. Radner Eds., Decision and Organization, North Holland Publishing Co.

[5] Fortunately there is a contingent in the AI community that advances this approach. See Russell, S., & Welfald, E. (1991). Do The Right Thing, MA: MIT.

[6] Attributed to D. Knuth: “Premature optimization is the root of all evil.” The sense of the aphorism is that one should avoid detailed performance improvements until those that best contribute to the entire system operation become clear.

[7] A nod to Cassie Kozyrkov for proposing the term “Decision Intelligence.” We could riff off the term “Data Science” and call it “Decision Science” but the term is not new: It has been used to refer to a kind of policy analysis, as in government response to climate change. See for example: R. J. Lempert, (2002) “A new decision sciences for complex systems” Proc Nat. Aca Sci, 99(suppl 3) 7309–7313. https://doi.org/10.1073/pnas.082081699.