Durable Preventive Actions: How Appian Systematizes Continuous Improvement
Quality is still important: Today’s modern software development culture prescribes agile development processes, frequent delivery, and fast customer feedback. So, isn’t it good enough to deliver something that generally works, then fix what your customers complain about the most? Probably not. Customers of your software, especially enterprise software customers like Appian’s, did not buy your software to be part of a quality experiment. They have a problem to solve, a business to run, and probably a product ROI to defend.
As a software engineering organization, you literally cannot afford to let your guard down on delivered quality. You need to move fast. But… as Mark Zuckerberg said in 2014 when changing Facebook’s “move fast and break things” motto, speed demands quality:
“What [Facebook] realized over time is that [move fast and break things] wasn’t helping us to move faster, because we had to slow down to fix these bugs and it wasn’t improving our speed.”
For the sake of your own speed and competitiveness, you need to continually invest in improving your ability to deliver high quality software. It is good for your customers. It is good for you. Unfortunately there is no standard recipe for how to do this in a modern software company. Read on to see an example of how we keep getting better at creating a high quality product.
Continuous Improvement — How do you do that?
Success is a lousy teacher, as Bill Gates said in his 1995 The Road Ahead, so we make sure we learn from our non-successes. Every bug that leaks to our customers contributes to the continuous improvement of our Engineering department. Each goes through a systematic Root Cause Analysis (RCA) that yields specific activities to keep that type of bug from recurring. We call the result of these activities Durable Preventive Actions (DPA).
The RCA process is designed to maximize the value of time invested by avoiding 3 common problems:
- Software bugs are viewed too narrowly as a fault of testing
- Identified preventive actions are too generic and broad to cause change
- While started off with good intentions, energy for the RCA effort flames out
Just Add a Test, Right?
Yes, but there is more. Testing is almost always the reason a software bug was not identified, but it is almost never the root cause of a bug. Why? Testing does not change the code. The true fault is typically a requirements, design, or coding mistake upstream of testing. A suite of tests created as the result of RCA will only serve to identify bug recurrence more quickly. It will probably prevent that specific instance from leaking to users, but probably does not address the systemic root issue that truly caused the problem. The ultimate goal of the RCA program is to get at the systemic issue and keep the mistake from being made in the first place.
So, how to look past the tests? In order to encourage broader consideration of causal factors, the Quality Engineer from the agile team (aka “squad” at Appian) most responsible for introducing the issue leads a cross-functional discussion structured around the 4 main areas of the software lifecycle: Requirements, Architecture/Design, Coding, Testing. Looking at issues from each of these viewpoints typically yields a comprehensive understanding of what actually went wrong. Chances are your SDLC does not look exactly like Appian’s, but you probably have process, formal or not, around these areas.
Once the team has identified the root cause(s), the next step is to identify how to prevent future recurrence.
Preventive Actions Need to be Durable
“Be more thorough in code review next time”
“Make sure design better supports performance needs”
“Be more complete in requirements next time”
“Test more completely”
The above, all too typical, Preventive Action statements are weak. They have at least 3 things in common. They i) all could result from RCA analysis of an issue ii) are equally not actionable iii) provide no way to prove they will prevent future issues. Why? They are too aspirational. We were already likely trying to do these when the issue slipped through. A Preventive Action needs to result in something that will reliably prevent this specific class of issue from recurring. They need to answer questions in the context of the identified issue:
- How to be more thorough in code review?
- What does designing for performance mean?
- What is a “complete” requirement?
- What does “completely” test mean?
In other words, effective Preventive Actions must be durable. They must change something that will reliably alter the process/tool/training that caused this issue in order to not allow future issues of its kind regardless of who, how, or when it would have been created.
“Add static code (lint rule) to catch use of duplicate keys to build pipeline”
“Correct developer documentation on the meaning of log levels”
“In design review checklist, ensure that system performance in response to data growth is projected”
“Update testing checklist to ensure new feature test cases cover documented feature behavior”
How are these Durable Preventive Actions different than the earlier ones? They are specific changes that address a particular root cause. Since they are Software Development Lifecycle (SDLC) process and artifact changes, their impact is guaranteed to persist well past the lifetime of the RCA activity. They are immediately part of the development organization’s institutional memory and have no remaining dependence on the people involved in the initial RCA. They will persist and protect for the lifetime of the organization.
Appian’s learning culture provides a great platform to ensure there is a body of knowledge readily available to engineers to place DPA results. The better the training material and process automation you have in place, the easier it is to make durable changes. Actually, one of the first things you may find when first making Preventive Actions durable is you want to improve your general learning environment. When updating process or training docs for a DPA, be sure to reference the specific issue(s) that instigated the update. This provides invaluable context for engineers to understand why the DPA matters.
Keeping the RCA process thriving
I’ve told you how we do the analysis and why the D in DPA is important. Now, how do you ensure that the effort you put into your RCA process does not have a eulogy like the following one I came across?
“Sadly we fell behind on doing those after a rather sizable rash of issues and then fell off altogether. … This is something I intend to reinstitute very soon … once we can work the kinks out of the process that allowed this to fall by the wayside.”
At Appian, we keep the process vibrant by assigning clear accountability for the RCA process and clear visibility into its progress. The squad that fixes the bug is responsible for the RCA/DPA and the Quality Engineer on the squad is accountable for ensuring it happens. Visibility is provided through a Root Cause Analysis workflow built in our Application Lifecycle Management (ALM) tool. The process flow below shows the states of the analysis:
Most of these process states are used to manage the creation, then review of the RCA findings. Many DPAs are completed by the time the last review phase completes. When there are incomplete DPAs, the RCA ticket is moved into “DPA In Progress”. It remains there until the related DPAs are complete. At that point, the RCA is closed as “Done”. This state model allows simple visibility over the entire process using the same tool we use to manage other daily engineering processes. It ensures individual RCAs are moving through the process at an acceptable rate and that most importantly, they are not closed until all associated DPAs are complete. After all, the tangible value of the RCA process is the completed DPA.
Final RCA/DPA Words
Whether you have an existing RCA process you are looking to tweak or are deciding to put one in place for the first time, please also consider these lessons we have learned:
Don’t rush it. The RCA process is hard. Working back through past engineering decisions to determine how they could have been made better is technically challenging and time consuming. While we describe strong governance over the process, we do not try to rush it. For the activity to be valuable, sufficient thought needs to be put in and the right conversations need to be had. If you find yourself rushing them along, take a look at what you are getting to make sure it is actually valuable.
Don’t force it. Some problems are one-time human mistakes, results of incorrect judgement calls, or in a system built well before current quality process were in place. There may be little to no value in applying RCAs to these issues, so only follow the RCA process for issues that can truly be learned from.
Expect many small steps. Software Continuous Improvement is typically an ongoing iteration over current process. It is uncommon to find something simple that can be done that will avoid an entire large class of problems or prevent a wide swath of issues. Success is often continually finding tweaks to your infrastructure, process, and/or training to prevent issues one step at a time.