Focusing on Problem Aging

I’m working in a software company which develops products in the domain of unified communication and contact center. One of our focus areas is quality. Not just quality, but quality built-in. Quality is something we don’t compromise on. The teams are encouraged to fix problems as they detect it — push for zero defects approach.

Before we drill down to why it is important to fix problems as soon as we can and not let them age, I want to say that this was not a trivial approach for teams that were used to work in a traditional software development approach (AKA waterfall). Most of our teams used to handle the defects and problems in the “hardening cycle” that came always at the end of the release. Some teams also struggled to give proper attention to defects coming from customers, as there was a pressure to deliver more features into the products.

As soon as we shifted into agile software delivery approach, we have worked hard on the quality aspect. We worked with the teams to deliver work-items that had no defects, within every sprint. This required building teams that contain both testers and developers. Moreover, everyone on the team owned the quality and contributed to the testing effort (preferably automatic testing and TDD/BDD). For defects that escaped and came from outside the team, we made sure that team has enough capacity to tackle those throughout the sprint.

Quality was so important for us in two aspects. The first aspect was reducing cost by eliminating waste— the more time elapsed from the moment the defect was created to the moment the defect was fixed, the greater the waste is. The increase in waste is exponential to the time elapsed. This means we need to find the defects as soon as we can, and fix them as soon as we find them.

The second aspect was that the longer the customer waits for a defect resolution, the more unsatisfied the customer is. Any defect that is found by our customers should get the proper attention. We don’t want to let those defects to age and we don’t want to let our customers to wait too long. The big question was how should we manage that — how do we prevent problems from getting old in our system?

The first approach we took was measuring our defects queue size. We have built defects trending graphs that showed how many defects are still open at any given day. Our target was reducing the total number of defects. We have started with couple of hundred defects. We reduced our velocity for feature delivery to allow proper capacity for handling the defects. After 6–12 month we reduced the defect count to just a few dozens of defects. At this point, we decided to increase our features velocity while maintaining the same level of defects queue (not letting the number of defects increase). We figured out that having a few dozens of defects at any given time is good enough — we fix some while new ones are being created.

Apparently, maintaining the same size of the defects queue wasn’t good enough. Our customer support folks told us that the age of the defects plays a big role in customer satisfaction. They told us they expect two main things from our engineering department: (1) initial response time of up to 3 days for any customer defect (2) a fix as soon as we can.

For that to happen we changed a few things in the way we operate. First, we worked with our teams to have response time of up to 3 days for any customer defect. As we know that context switching is a killer for productivity, we have set a ground rule that every time a team member finishes a work-item in the sprint, that team member takes a look at the defects waiting for response, and picks up one before moving to the next work-item in the sprint. We also worked with teams to slice the work-items small enough to be time boxed by a few days each. This ensured that we don’t do context switching too often and that we provide fast initial response to defects.

Second, we had to track the defects aging time, to make sure our total aging time is not increasing and that our average aging time is not increasing. This should make sure we have the right focus not only on the total number of defects, but also on the time of the defects. We have recently started to monitor that, and it triggered a change in our behavior. Instead of mostly looking on the fresh defects, we started to watch those old defects that lingered behind. We realized that for each of those defects there is an unsatisfied customer attached to it, which maybe already gave up on our product.

I like the behavior change that was triggered as result of monitoring the defect time and I intend to keep using it. If I would do it all over again, I would probably watch that earlier in the process, but not before making sure the defects database is clean. By saying clean, I mean there is no non-relevant defects there (obsolete defects, defects that were already fixed, etc). So the strategy I would recommend is: clean your defects database, monitor the defects queue AND the defects time (total age + average age).

For more information about this topic, please refer to Management 3.0 Problem Time.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store