Learning From Disaster: Crashing Planes to Productive Teams
What Crew Resource Management Can Teach Us About Building Software
Before I became a software engineer, I was an Aircrewman in the US Navy. Looking back on my time as a Naval Aircrewman, I learned many important lessons that I’ve been able to apply to my work in software development. Whether we’re talking about aircrewmen or software developers, teams can vary widely in style and approach.
In both roles, the teams I’ve been part of relied on their members to not only own their function (Pilot, Radar Operator, Flight Engineer vs Mobile, Web, Backend Developer), but also to communicate information effectively, and speak up about important issues regardless of station.
While these are important attributes of any well-functioning team, these teamwork skills are specifically trained for in the Navy and are something I’ve carried over into my career in tech. Called Crew Resource Management (CRM) training, this training was developed by NASA Ames in the 1970s to address the role of human error in accidents involving perfectly well-functioning aircraft.
The Role of Human Error
So, what led to the creation of CRM?
One of the major events inspiring the development of CRM training was the United Airlines Flight 173 crash of 1978.
According to the National Transportation Safety Board (NTSB)(Footnote 1), the crew of Flight 173 was circling the Portland Airport troubleshooting a problem with the landing gear prior to landing. In the cockpit of the DC-8 sat the Pilot, Co-Pilot and Flight Engineer, diligently working through the emergency gear-up landing check-list. According to the cockpit voice recorder, the flight engineer mentioned the dwindling fuel supply on multiple occasions. Unfortunately, the pilot failed to comprehend the gravity of the fuel situation and made a turn away from the airport with less than five minutes of fuel remaining. Flight 173 ended up crashing six miles from the airport because of an exhausted fuel supply.
More about this accident can be read here.
CRM and Shared Mental Models
CRM was designed to encourage better communication between members of an aircrew to better facilitate a shared mental model and reduce these occurrences. While CRM training takes different forms depending on the military branch or airline that develops the training curriculum, it is now a required part of any military or civil aircrew’s training.
During my time as an Air Warfare Systems Operator in US Navy Patrol Squadron 47, my CRM training consisted of seven factors in ranked order:
- Situational Awareness
- Decision Making
- Mission Analysis
CRM and Software Development
Since the 1970’s, CRM has been a staple of aviation training for both pilots and crews. Today, the principles of CRM training are starting to find applications outside the cockpit in domains where human error can impact expected outcomes. For example, some hospitals have adopted CRM training for the operating room to reduce human error (Footnote 2) among medical practitioners. But is CRM applicable to something as benign as software development? It’s true that our failure modes are nowhere near as critical or life threatening. However, if you’re reading this and you’re like me, you probably don’t think anything about software development is benign. While software projects may take longer to fail — and have markedly different impacts when they do — according to a 2008 study by IAG Consulting, 68% of software projects have an “improbable chance of success.”(Footnote 3) What if improved shared mental models could impact that number in a positive way like they have for flights and surgical operations?
While crashing a software development project might not mean the difference between life and death, there is still much that can be learned from the principles of teamwork designed to reduce errors and foster more meaningful communication found in CRM. Let’s go through the seven aspects one by one.
The Center for Naval Analyses, sponsored by DARPA, wrote a 70+ page paper on Defining and Measuring Shared Situational Awareness(Footnote 4). I won’t attempt to dive that deep here. One quote from the paper defines situational awareness as “A clear and accurate, common, relevant picture of the battlespace.” Software engineers work in an exceedingly complicated battlespace every day. As a software engineer troubleshooting issues with a development team, it’s important to establish a shared mental model of the problem prior to beginning work.
When troubleshooting an issue in a software application it’s important to communicate early on certain baseline factors about the situation. Which environment are we troubleshooting in? Do we have a snapshot of the code that is causing the problem that mirrors the last release? Do we have logs from both the front end where the crash occurred as well as any associated backend logs of the session that might help build a better mental model of the situation leading to the crash?
Your team will probably have their own specific needs for building a shared mental model so take some time to think through what that might be prior to your next unexpected war room bug bash.
The US Coast Guard provides some helpful tips in identifying whether you or your team are experiencing a loss of situational awareness(Footnote 5) and might need to take corrective action. If you find yourself operating off gut feelings instead of proper procedures, or start to see that you are fixating on an issue that may not lead to resolution of the greater problem, then it might be a good time to circle back with your team and make sure everyone is on the same page. Troubleshooting an issue and overly-focusing on delivering solutions might cause you to skip certain procedures like running unit and integration tests. But without running these checks you might end up causing a larger problem down the road.
Note — If your CI/CD pipeline is setup correctly you won’t be able to skip these steps anyway, Jenkins is your crew mate too :)
Every day, from the moment you first start up your computer, you’re inundated with all kinds of information. However, without complete information, or perfect information, you can never make a perfect decision. What you can do is take the information you have and apply logic and sound judgement to make the best decision possible.
While some decisions are easy, others may require a framework for working through the problem. One such framework for solving problems is described in the training manual Aeronautical Decision Making for Helicopter Pilots(Footnote 6).
- Detect change
- Estimate significance of change
- Choose outcome
- Identify options
- Do best option
- Evaluate results
Take the example of a situation in which a third-party API your app relies on suddenly goes down. Obviously, to make any decision you must be able to detect that the change occurred, so keep in mind that information needed to make decisions not only needs to be accurate and plentiful but timely. Next, you must determine the significance of this change. Is this a minor inconvenience for your users or is this app- breaking behavior? The answer to this question might mean the difference between jumping on a call with the third-party API’s CTO or simply ensuring the error state is handled correctly.
In determining the significance of a problem or change you can use a number of questions to help determine the significance of a change. From the Coast Guard Training Manual on Decision Making(Footnote 7):
- Who is affected; who is not?
- What situation is affected; what related things are not affected?
- Where is the problem?
- When did the problem occur?
- Do areas affected by the problem affect other areas? To what extent?
Once the significance of the change is known, you must identify what the outcome should be. For example: “The user should not even realize the API is down.” In this situation, the various stakeholders would of course need to be involved in determining what the desired outcome should be. Then it’s on to engineering to decide what the options are for achieving this desired state. The options might include:
- Don’t let the user see the broken feature, just toggle it off.
- Use an alternative API while this vendor is down.
Once the options are identified, you must execute the steps required to implement that option. And finally, the job’s not over when the change is pushed. Make sure to verify that the change meets the expectations of the outcome chosen by your team.
Rank structure is a culturally important part of the US Armed Forces. Questioning a higher up is generally not practiced in most military environments. The deck of an aircraft is one unique situations where the good of the mission and the safety of the crew outweighs some standard military protocols. For example, it’s perfectly acceptable for the lowest ranking member of the crew to bring up an unsafe condition on the aircraft to the mission commander.
Assertiveness is an asset aboard a military aircraft because people who act in assertive ways do so because they have a stake in the outcome and success of a mission. The same is true of software developers who exhibit assertive behavior.
Being assertive on a software development team means you care about the product and are willing to risk putting your ideas forward and having them proven wrong. This is an important aspect of assertiveness. Assertiveness is not aggressive behavior and it does not mean asserting your own ideas above all others because they are your ideas. It means having a “willingness to actively participate, state, and maintain a position until convinced by the facts that other options are better.”(Footnote 8)
It’s up to all members of a software development team to actively participate with their best ideas during the software development process, calling out issues as they encounter them. It’s also up to the leaders of that software development team to harness that energy, making sure team members feel empowered and like the team listens to their ideas or heeds their warnings.
This aspect of CRM is unique to the military branches and is not addressed in the civilian FAA CRM training guidelines. I include it here because I think the dynamic environment of software development benefits from its inclusion as an important factor in risk reduction. Just as it would be disastrous to take on any military mission without a sound plan, it would be equally disastrous to take on a new software feature or app without one as well.
The amount of time spent on emergencies all really comes back to the initial mission analysis. Mission analysis should occur prior to an emergency occurring, so it’s a good thing to keep in mind throughout the development process. Just as weather changes can cause complications for a military mission, the factors that affect a software development project can change significantly day to day, or even minute to minute. These kinds of issues should be identified early in the planning phases of a project and systems should be put in place to monitor changes as they occur and alert the rest of the team as needed.
While it should be expected that a situation might change it’s no excuse to not have a plan at all. The better a plan is to begin with, the better ready your team will be to accept and adapt to changes as they emerge.
Sprint planning and grooming are your major mission analysis opportunities in software development. Here are some effective operational planning tips you can try applying to your meeting. Thinking through these bullet points for every story in your backlog may reveal interesting results.(Footnote 9)
- Define tasks based on mission requirements.
Example: If one of your stories is about helping the user more easily find the help button on your app, make sure to create tasks to address that story based on specific requirements. “How should we make the help button easier to find? Should we change the color or make the button larger?”
- Question data or ideas as they relate to mission accomplishment.
Example:If our goal is to make sure the user gets help when they need it, maybe we could identify when they are having a problem and prompt them to see if they need help directly instead?
- Discuss long and short-term plans for the mission.
Example: Maybe in the short term, we can make the button larger while also creating a long-term plan to better analyze user behavior so we can actively determine when they need help.
- Identify the impact of potential hazards and unplanned events on the mission.
Example: Increasing the button size might cause issues in different orientations or dimensions, let’s be sure we test all these scenarios.
I used to think aviation radios were the worst way to communicate in the modern age… then I sent a group email. It’s surprising the way that context and tone can be lost in written communication, it makes the art of communication over email or Slack even more important.
Clear communication is something every team must work at in order to preserve resources and time. Remember that even face to face communication doesn’t guarantee better results. Have you ever told someone something you thought was extremely important, only to have them say they didn’t remember you telling them at all? During stressful situations, it’s common for people to acknowledge what you told them with an “okay” — whether they really registered your words or not.
You can improve your own communication by not just reflexively acknowledging what someone has told you, but repeating their word back to them or even re-phrasing it to ensure you understand the meaning behind their words.
Remember — communication is the skill that all other CRM skills hinge on. If you’re not communicating enough, or not communicating effectively, then none of the other skills matter.
On an aircrew, leadership is not the sole responsibility of the pilot or mission commander. Each member of the aircrew has an area of expertise that makes his/her position on the crew special. This expertise makes them exceptionally suited to identifying certain problems related to their field/expertise. In the same way, the members of a software development team are each uniquely suited to address different issues depending on their areas of expertise.
Part of leadership is never assuming that because you’re not in a designated leadership position, that you can’t lead the team in one way or another. While it’s important to realize that the designated leader has the final word on decisions, that doesn’t mean that your voice can’t be one of those guiding the team to the final decision… in fact it should be!
It’s no secret that software engineers love to solve problems. Sometimes, unfortunately, we love solving problems so much we create problems while attempting to solve others or we focus on fun problems we want to solve rather than the important problems that need to be addressed:
“I’m going to write Unit Tests… but first I have to find the perfect framework”
“None of these perfectly fit my use case… I need to write my own framework!”
A junior team member can be a great counter balance to potentially inefficient tasks by simply asking clarifying questions around development tasks that might be introduced. Asking “why” is always a useful exercise since if the “why” isn’t clear to you it may not be clear to others either. Then, if the “why” still doesn’t make sense, you may have an opportunity to lead your team down the right path.
The ability to be flexible in a tough situation and adapt to changes that may occur mid-mission are critical skills for an aircrew to learn.
Software developers can be asked to be flexible in surprising and significant ways too. Being willing to adapt to those changes makes a good development team, a great development team. Being asked to adapt is often an acknowledgment of your ability to adapt to a situation. It’s also your opportunity to excel as a result. Teams can show their flexibility in several ways from implementing a new solution to an existing problem, to taking over a whole new project, to supporting a new language or development framework.
In these situations, teams have two options — either adapt and survive, or refuse and break under the pressure of new changes happening around them.
The principles of Crew Resource Management have contributed to the reduction of human error by increasing team awareness and promoting team cohesion. The same principles that keep planes in the air and surgical teams alert and communicative can be applied to software development teams.
None of these seven CRM skills are ground-breaking-innovative-breakthroughs™. I don’t believe that telling a flight crew, surgical team, or development team, to “focus on communication” will make anyone better communicators. The value of CRM doesn’t emerge by having one team member focus on a single skill or two, but in wholesale team buy-in to the framework of values put forth in CRM. If every team member understands that they’re expected to show leadership in their own way and to be assertive, then communication on the team will become more productive. Communication fosters greater situational awareness for the whole team which improves the ability to plan and analyze future projects. A team that is planning ahead will be much more capable of adapting to changes and making better decisions.
While CRM was developed to guide aircrews through a journey safely, I hope that keeping these principles in your thoughts the next time you work with your development team will help you problem solve more effectively!
- https://www.ntsb.gov/investigations/AccidentReports/Reports/AAR7907.pd f
DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2017 Capital One.
For more on APIs, open source, community events, and developer culture at Capital One, visit DevExchange, our one-stop developer portal: developer.capitalone.com.