Theory of Constraints 105: Drum-Buffer-Rope at Microsoft

A series of 5-minute posts on applying principles of flow to knowledge work

In the previous post, I explained Drum-Buffer-Rope (DBR), the original application of TOC to production environments like manufacturing. We’re now ready to take a closer look at a real-world example that brings together all the ideas we’ve covered in the series so far.

The story begins with a young program manager at Microsoft, Dragos Dumitriu, deciding to take on a challenge: turning around the worst performing software development team in one of the company’s eight IT groups. By the time he was finished nine months later, the team was the best performing in its business unit, with an improvement in productivity of 155%, lead time reduced from five months to two weeks, and due date performance improving from near zero to over 90%.

XIT Sustained Engineering was responsible for maintaining over 80 applications for internal use by Microsoft employees worldwide. This involved development and testing for small change requests (often bug fixes). The first quarter of 2005 saw the worst ever productivity from the team — the backlog exceeded capacity by a factor of five (and growing). Needless to say, the four internal customer groups who needed their applications maintained were very unhappy.

XIT received about one change request per day, or 85 per quarter. But the team of three developers had an average capacity of only 6.5 each (about 20 total), leading to a throughput over the previous quarter of only 17 completed requests:

When a new change request arrived, it needed to be evaluated for a rough time estimate. The agreement between XIT and its internal customers specified that this estimate had to be performed within 48 hours, which meant that it had to be expedited as a top priority.

Here we begin to see the problem — producing an estimate required four hours each from a developer and tester, meaning that each change request sucked one full day of productivity out of the system. One request per day, one day per request — they were standing still. Just producing estimates was taking up to 40% of total capacity:

It gets worse. Historical data showed that only about 50% of requests were actually completed by the team. The other 50% were either too big, too expensive, or too late. Only half of these estimates actually contributed in any way to throughput, which means that 15–20% of total throughput capacity was being used to evaluate work that would never even be attempted. It was pure waste:

Even worse: even this evaluation work would never be used, because so much time had passed by the time work began that it had to be done again.

Which means that nearly all of the evaluation work was waste:

Because the backlog was so large, it had to be continually reprioritized. The monthly prioritization meetings got more and more stressful, as customers started fighting to have their priorities included. Trust broke down as they stopped believing their requests would ever be acted on without Severity 1 urgency.

Sound familiar?

Dumitriu came up with three interventions in collaboration with David Andersen, one of the first to apply TOC to software development. They may seem like common sense in retrospect, but the conceptual framework of TOC served as a belief-reinforcer and permission-giver for delicate changes to how the team worked and communicated with its customers.

The first intervention was to add a buffer, with eight slots in the queue before the development team, which was the bottleneck in the system. The goal was to stop the flow of new requests directly into the bottleneck, where the need for a timely estimate sucked capacity, pushing the schedule out and requiring constant reprioritization and rescheduling.

Management would only commit to a delivery date once a request was estimated, scoped, and assigned to a slot in the queue. This allowed project managers to set expectations and promise a realistic timeline. It choked the incoming flow to a volume that the team could actually consume. It reduced lead time per request, because developers were focused and single-tasking, instead of switching between priorities or fending off incoming requests. The monthly prioritization meetings became much less stressful.

The second intervention was to stop providing upfront estimates. Every change was assumed to take the average of five days (with a couple extra rules for edge cases). This was only possible after the buffer reduced and stabilized the average time per request.

The third intervention was to reallocate resources to optimize the bottleneck. One person was reassigned from testing to development, changing the ratio from 3:3 to 4:2.

Once bottleneck capacity had been fully utilized and expanded using available resources, it was time for the final step: further increase bottleneck (and therefore system) capacity by hiring more developers. Adding capacity before you’ve used what you already have would be like adding lanes to a freeway while a sofa blocks the express lane.

Total throughput increased from 17 requests per quarter to 56. The backlog was reduced from 80 requests to under 10, and the average cost per request fell from $7,500 to $2,900. The customers were happy again.

It is often assumed that, if TOC applies at all to modern knowledge work, it must require the most advanced applications. Using a simple approach like Drum-Buffer-Rope in this case showed that much of what constrains the productivity of software engineers (and by extension, other knowledge workers) is not related to the details of how they perform their work, but to the management, planning, scheduling, and queueing of work.

In other words, engineering may be the bottleneck of the whole organization, but management is the bottleneck to improving engineering.

Next post: #106 The Five Focusing Steps >>>

<<< Previous post: #104 Balance Flow, Not Capacity

Follow the series by subscribing to our members-only publication Praxis for just $5/month, or follow Tiago via Medium, Twitter, or email updates