Making Durable Decisions

6 min readJul 28, 2014

Being in an engineering organization, making durable decisions is a constant task. Before executing a project, it’s incredibly valuable to make the macro-decisions, and agree upon them before beginning execution. In addition to this, the human-factor is a major aspect of making decisions. Lastly, dabbling in the systems, and networking side of the house, a constant battle that comes up is corruption. Now, the question is how do you do all of this without reducing engineering agility. Hopefully, the tools below will help you make good decisions.

Making decisions is hard. Often, it’s difficult to generate a cohesive big picture. Donald Rumsfeld puts this well:

Reports that say there’s — that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don’t know we don’t know.

Define the problem

Engineers often get caught up in the technical problem at hand, but at the end of the day, engineers are here to create business value. Coda Hale has an excellent talk related to this called “The Programming Ape.” For example, our problem might be to charge credit cards.

The Scene

A small start-up who’s about to begin monetizing their first products with credit-card payments. There are maybe 10,000 payments a day, each of about $100. Taking someone’s money without giving them a product is simply unacceptable. Taking someone’s money multiple times would simply be a PR disaster at these volumes. The start-up’s competitors are close behind, and failing to be able to service the product-flow would result in lost customers, so deferred revenue recognition, and loss is preferred over perfect revenue recognition. Lastly, everyone knows that delay matters.

What does charging credit cards involve? It’s actually a ridiculously complicated industry. Let’s say we don’t want to deal with these complexities, and we want a solution that enables our developers to get a product to market quickly. I suppose the easiest solution is to go with someone like a Stripe, or a Amazon (ideally that decision would be made the same way), so for a moment let’s reason about the complexities at hand:

Stripe & the network between the two entities is not 100% available: https://status.stripe.com/.
Calling out to an external service means that the latency that the user is going to experience is going to be suboptimal.
Losing credit card charges means lots of potential lost revenue.

So, now the rest of the yak-shaving begins — beyond the actual credit card processor, we need to reason about a queue, background worker tool, and monitoring. For the sake of this discussion, let’s focus on reliable queues.

Define a Fitness Function

In today’s world, we don’t have unlimited resources to build the perfect queue. We can spend all the time in the world theorizing. Decision making in the real world deals with far too many superpositions to reason about for an average person. If we had all the time in the world, we could implement every possible solution, and test them independently to decide what’s the best option, but that’s left up to researchers.

In addition to this, it’s difficult to really know what matters as part of a solution. These are the known knowns. They can be used to easily remove solutions from the competition. We can take the business values above and turn them into a few concrete metrics.

The Scene

Reliability: The systems needs to not lose data during enqueues. We need to try as hard as possible to have exactly-once, or at-most-once queue operations.
Ease-of-use: Our start-up has a limited number of engineering resources, so we would prefer to have a simple solution.
Cost: Our start-up faces the existential risk of failure due to our product failing, and not because we’re not making money. It’s better to have a product than nothing at all. Although, we can’t give it away for free, because all know the cost of free.
Performance: Credit card payments already operate in the area of seconds. The enqueue operation should be done in tens of milliseconds, but the processing time can easily be in the tens of seconds.
Operations: Your system (ideally) will spend more time being operational sans developers than it does in the hands of an engineer. Engineers are expensive. Due to this reason, we want a system that involves as little operational-cost as possible.
Security: Do not expose ourselves to undue security risk.

Determine The Contenders

This is where it’s valuable to have a diverse engineering team, and at least two diverse thinkers on a project. At this point, you objectively collect data on all possible choices. You focus primarily on the portions of the solution that apply to the fitness function, but it’s valuable to spend 1-2 weeks looking at various solutions and doing research. This exercise involves gathering information in an objective manner, as well as gathering war stories, and anecdotes from the community at large. Be careful when considering anecdotes, because the landscape quickly can change, and perceptions are skewed. Other industries (medicine, news, etc..) have learned their lesson, yet somehow our industry puts great weight into war stories, and heroes. The overall purpose of this exercise it address the known unknowns.

Prototype

At this point it’s also valuable to spend a time-capped period on a prototype. Rarely, will 1-2 weeks of delay hurt you more than having to recall a decision in 2 months through a project. The upfront opportunity cost of each prototype should low enough that the exercise is valuable. The sunk cost of prototyping greatly outweighs a failed project . It is almost always better to act than it is to get stuck in analysis paralysis, and in order to avoid this you should time-cap your research phase to something reasonable. At this point, you should have a fairly good idea of the unknown unknowns.

Act

There is nothing left to do beyond this point, other than act. In a reasonable amount of time, given the data you gathered, the solution should quickly become obvious, and if it doesn’t, a suboptimal solution probably wont tank your company. The truth is that it probably doesn’t matter what you choose, because any of the top solutions will work for you.

What does success look like?

Another valuable portion of this exercise is determining what success looks like. Earlier we stated that this start-up has limited resources. It’s valuable to determine the success criteria early-on, because work will fill the time-allotted (See: Parkinson’s Law). Before the actual act, make sure you have an obvious set of criteria that you can assess the solution against before saying you’re done. If you set your criterion at “fast” that might leave too much judgement for your engineering team, but if you see “must not generate greater than a 100 ms perception of delay to the end user” that’s much easier to evaluate again.

Time-cap It

By this point, you’ve probably gathered enough data to understand how long a reasonable solution will take to implement. It’s valuable to put a time cap on it to generate a sense of urgency. This enables people to quickly prioritize what matters, and understand how far down the rabbit-hole they can go without putting the project at-risk.

Record It

People move-on. The engineering team you’re working with today will probably not be the engineering team maintaining this product in 24, 12, or even 6 months. People will come, and go. The industry will change. It is best to record why you made decisions, or at least the major aspects of the decision against the fitness function. This enables new engineers to understand why decisions were made, and to make decisions with the same process, and ideology.

In addition to the benefits around training new engineers, recording solutions is a way to prevent corruption. As the success criteria, and the business needs are more than likely generated by executives, or people who are responsible for the success of the company long-term, it becomes hard to hide corruption in the shadows. The incongruity of business requirements to a fitness function, and a solution to a fitness function becomes nearly impossible to hide when they’re written down, and available to an entire engineering organization.

The Loop

Some portions of this process were taken from the OODA loop, an ancestor of modern agile principles. You’re going to be wrong sometimes, and that’s okay. Nearly everyone is wrong at some point. It’s valuable to lightly reassess the decisions made at the beginning of the project to ensure that the assumptions, and the results of the research were valid. Additionally, on a long enough time scale, your decisions wont be durable. User-demand may have changed, your processor might have gone sour, or there may be new contenders in the market. As conditions change, it’s valuable to validate your assumptions still stand, and if they have varied far enough to legitimize redoing the work, begin the process once again, like Sisyphus.

Making Durable Decisions

Define the problem

The Scene

Define a Fitness Function

The Scene

Determine The Contenders

Prototype

Act

What does success look like?

Time-cap It

Record It

The Loop

Written by Sargun Dhillon