Givelify’s key to a successful Root Cause Analysis

R.K. Hari Krishna
Givelify Engineering
7 min readFeb 15, 2023

--

I hadn’t arrived at the office yet when the Slack notifications started erupting on my phone. The message: “Users report that their account picture has been changed to someone else’s.” I checked the screenshots forwarded by our customer support team. Although grainy, I immediately recognized the person in the picture — it was Alan, one of our Quality Assurance Engineers.

Alan had been performing a routine QA analysis on an unreleased mobile app the day prior. He had chosen a version of the app still in development from the cloud and begun testing. It wasn’t long before the support calls started pouring in. By the hundreds.

A secure experience is crucial for our customers using the Givelify mobile giving app and for us. And while their accounts were never in jeopardy, this required immediate attention.

If you’re a leader in tech, you know the feeling of walking into the office knowing that whatever is on your calendar that day will take a backseat. It doesn’t matter how brilliant your team is. Or how many fail-safes you have in place. Or how experienced you are. Some days, you must be a firefighter.

Thanks to the tenacity of our team, we were able to restore our users’ profile pictures in no time and get to the bottom of the issue. But as soon as the dust settled, we began the most crucial step in figuring out what really happened and how to future-proof our systems, a root cause analysis (RCA).

An RCA if done right can expose critical vulnerabilities in your systems and processes. It will help you find areas for improvement as a team. At Givelify, we’ve developed a unique approach to conducting an RCA — and it starts long before an incident needs an RCA to be performed.

In this article, I’ll break down the fundamentals of leading your team through an RCA in a way that strengthens your company culture and bolsters your team’s confidence. And it boils down to one simple idea: blamelessness.

Building a Culture of Blamelessness

When we mess up, it’s perfectly natural for us, as humans, to try to assign blame — whether to a person, a system, or a poor management decision. And often, there is an acute cause. In the case of “Photogate,” as it would come to be called, those reasons were a bad piece of code and poor communication in the team.

But every leader knows that most crises have many causal factors. And when investigating those causes, creating a detailed timeline of events is paramount. Unfortunately, that’s not easy when there’s a sense of fear of repercussions, distrust, or blame within a team — people are naturally apprehensive about coming forth and admitting fault in such an environment. So, unless you’ve established a foundation of blamelessness, it can be challenging to construct a detailed and honest timeline and unearth vulnerabilities.

Creating a culture of blamelessness begins long before any major crises. As you encounter day-to-day instances of human error — no matter how big or small — ask yourself how the problem can be addressed without blaming your team members. Here are three quick tips on how this can be done:

  • Humbleness: Creating a culture of blamelessness starts with you as the leader. Talk about what you, as the leader, could have done better. Share similar experiences where you made a mistake that might have caused an issue.
  • Forward-looking: Focus on what needs to change going forward rather than rehashing the mistake(s) that were made. Spotlight the progress made.
  • Reassurance: Highlight the fact that you made it through as a team. You can sometimes imbue crises — no matter how big or small — with humor. This really depends on the crisis.

Over time, you’ll build a sense of mutual trust and shared responsibility in your team. When inevitably a major system failure necessitates a root cause analysis, you’ll find it much easier to get to the bottom of it because you’ve laid the foundations for the expectation of full transparency.

Blameless does not mean there is no accountability for actions. If we do not follow through on improvements, if we are nefarious, if we blatantly disregard processes and procedures, then there will be repercussions. However, a blameless culture does not assume immediate fault, one that does not punish in-advertent actions or fault the individual for a systematic or process-related failure. It ensures we do not act out of emotion but using facts.

Focus your team on accurate and detailed timelines

Sometimes we may have to kickstart that culture of blamelessness during an RCA. You might even need to reinforce this behavior even if you have an existing culture of blamelessness. I had to go through this exact thing during the beginning of our “Photogate” RCA.

We had a lot of stress and apprehension within the team. So, I started with a joke to reduce the tension in the room, and then thanked everyone for jumping in to fix the incident to ensure quick customer satisfaction. I then highlighted the fact that we had resolved the incident, and we were past the hump. Then finally, to ensure we don’t start blaming, finger-pointing, and becoming passive participants, I focused the team on the most important part of the RCA: the timeline. The timeline of events Is also one of the most Important elements of a good RCA

Not only is an accurate timeline critical to a successful RCA but directing focus to creating it helps disarm any team members who may feel defensive or apologetic. In other words, the emphasis is on what happened rather than who screwed up. And as you construct that timeline, you’ll notice the discrete points at which things could have gone differently.

Transparency is critical in building an accurate timeline. And with an established culture of blamelessness, you can expect full honesty and collaboration from your team. With that understanding in place, reiterate your expectation that the timeline of events is as detailed as possible. This should be a non-negotiable.

For example, when we conducted the RCA following “Photogate,” we built our timeline using every shred of data we could get our hands on. Including:

  • Meeting times in calendars
  • Email threads
  • and conversation timeline
  • Slack message
  • Pull Requests committed, reviewed, merged
  • Server logs
  • App logs
  • Production deployment times
  • Customer service calls
  • Build times

Each data point revealed an opportunity for improvement. What if Alan had been included in X meeting? What if there had been more communication between the mobile engineer and the cloud engineer? What if there had been clearer protocols for the QA team?

Armed with these data and insights, we built a highly detailed timeline of events down to the millisecond.

With this level of detail, this allowed us to have an honest conversation around the Five Whys. We pinpointed the exact points at which our systems had failed: Alan, the QA engineer, was not aware of the system’s architecture because he wasn’t brought into some of those the necessary conversations. The mobile engineer didn’t know what was actually deployed to the cloud. And the cloud engineer didn’t realize what version of the mobile apps they were building.

Each engineer was able to help us expose the points at which a communication breakdown had occurred, providing valuable information that informed our new set of protocols. And this was only possible due to our team’s sense of shared responsibility for producing an accurate timeline of events and subsequent conversations.

Fostering Transparency Across Teams

As a final step, we held a company-wide meeting to present the findings of our RCA. We opened the meeting to everyone — customer support, marketing, and all of our engineering teams. We encouraged questions and participation. This collaborative approach helped us tie up all the loose ends and put “Photogate” in the rearview mirror.

Also, it served as a secondary function. By holding an open forum, we demonstrated our culture of transparency and instilled an organization-wide understanding of trust, accountability, and open communication. Such culture is crucial in mitigating technical failures and preventing daily misunderstandings and conflicts that can erode morale.

The fact is, it’s natural for information to become siloed within teams — we talk more with those we work closely with. But on its face, team-building — and, by extension, company culture — isn’t about holiday parties and retreats. It’s about finding opportunities, wherever possible, to improve communication across teams and departments. And as you build those lines of communication, you’ll establish a solid foundation of companywide trust and transparency.

Final Thoughts

Ultimately, your team’s greatest asset in the wake of a system failure is a culture that allows you to quickly deal with it and then implement the necessary improvements. As leaders, we can get out ahead of any crisis by instilling in our teams an understanding of blamelessness and transparency when things go wrong. And once the dust settles after the next inevitable crisis, the hardest part of an RCA will already be done — the recognition of shared responsibility.

Thankfully, Givelify hasn’t dealt with any more systemwide outages since “Photogate.” Like so many organizations, it took a crisis to improve our processes, strengthen our systems, and bring our teams together. Our customers have all but forgotten about the incident, as reflected by our 5-star app store ratings and continued growth. And as for Alan? Well, some of us still haven’t changed our profile pictures.

--

--

R.K. Hari Krishna
Givelify Engineering

VP of Technology at Givelify; Electrical Engineer; Tinkerer; Technology with purpose; Advocate for inclusive engineering culture