Handling Incidents at Xendit

This is the 1st post of a blog series on handling incidents and conducting blameless postmortem at Xendit. In this post, we will explain and share how Xendit handles the incidents and continuously improve the process.

Wildan S. Nahar

Published in

Xendit Engineering

9 min readFeb 8, 2022

Wildan S. Nahar with Luminto Luhur and Theo Mitsutama

Background

Xendit’s mission is to make payments easy for businesses and to build digital finance infrastructure in South-East Asia. Xendit has scaled to become the leading payment gateway in Southeast Asia by offering valuable products like API integration, Libraries/SDKs, Dashboard UI/UX, and Plugins like Shopify & Woocommerce.

When we were much smaller in size a few years back, speed of iteration and development was vital to our survival. We managed to onboard key customers to supercharge our growth. However, as we onboarded more and more customers, we started to see things were not working as expected and heard reliability concerns directly from our key customers. Problems were starting to arise and becoming widespread that affected the majority of our customers. And because there was no procedure for incident handling, most of the employees were involved which made the recovery process become cumbersome.

To combat this, we build a standardized incident response process to better serve our customers during periods of downtime. This is to give everyone else in the organization peace of mind that someone is always available to handle the incidents.

Incident Types

As with every tech company, there is a myriad of root causes, which include but are not limited to self-inflicted incidents: application logic error, misconfigured infrastructure, or our partner’s problems: unreliable payments webhooks, timeout issues, unscheduled maintenance, and many more.

For partners’ problems, most of them are not fully within our control. Thus we build contingency plans to handle such cases:

Setting up monitoring systems and SOP
Setting up tracking mechanisms to provide data to partners during QBR (Quarterly Business Review) to ask for system improvement prioritization
Build automation based on patterns we found
Build circuit breaker mechanism to react faster upon partner’s system failure

For self-inflicted incidents, these incidents are fully under our control and could be prevented from happening again in the future. This article describes how Xendit handles incidents across multiple partners and internal teams and ensures every incident is communicated well and handled in a timely manner. There will be measurements to measure the success of incident handling, which we will talk more about later below.

Roles and Responsibilities

Defining roles and responsibilities are important so everyone knows what they need to do in the event of incidents. In the process of doing so, as we have handled incidents before, it was natural for us to observe what each role has been contributing to resolving past incidents. We just need to document these roles and responsibilities so current and future employees can easily refer to them.

To give you an idea of how it looks like at Xendit, we have multiple roles that are involved in a fire (we call incidents “fire”):

Fire Warden (will be referred to as FW for future reference): The main person to coordinate incident resolution across teams and provide concise communication to internal and external stakeholders
Engineers: Investigate and diagnose technical problems and work to drive recovery and resolution
Customer Success and Account Managers: Communicate an update or resolution related to the incident to customers and give assurance to customers that we are actively the incident
Product Managers: Bridge communication between stakeholders, especially to communicate product impact to stakeholders

The FW role is one of the key roles to the success of Xendit’s incident response. To ensure on the quality of incident handling, the FW has more responsibilities:

Be available 24/7.
Acknowledge an incident or a potential incident raised by customers or internal teams.
Set up a war room and gather relevant people/teams to identify the impact on customers.
Communicate the impact and status of the incident effectively and periodically to customers through the Status page.
Drive recovery and reconciliation of the incident as part of incident resolution.
Assign relevant people to conduct post-mortems. Contribute and review post-mortem.

In terms of people allocation, we do not have a dedicated team or person to focus on the FW role. Instead, we select people who have a great track record of clear communication and continuously have proven themselves to make quality decisions autonomously. At the time of this writing, we have 25+ people for the FW role, from Leadership, Product Managers, and Engineering Managers, with a rotating shift every week to be the FW.

Now that we have defined the roles and responsibilities, let’s look at how each role functions and what the process looks like.

Incident Handling

When the incident happens, the FW will require collaboration with teams and stakeholders to identify and resolve the incident. Below is the “simplified” version of the incident response flow at Xendit.

Here are a few explanations of the above steps followed by each role’s responsibilities during the incident.

Escalation

Generally, there are several types of escalation. From the product engineer’s perspective, he/she will rely on the product’s monitoring system (with the configuration in place: threshold, metrics, synthetics test) which observes the health of the system and will immediately notify the engineer on call should any problem arise. On the other side, the customer service or account manager also has the “right” to escalate possible incidents if there are recurring or increasingly problems brought up by the customer.

After that, together with the engineer on-call, FW will acknowledge the problem and verify if it is indeed a widespread problem or isolated problem (only happens to a single customer). If it is a widespread problem, FW can escalate and declare the incident as fire and update the status page. At this stage, it is necessary to coordinate thoroughly with the product engineer to “probe” important data such as the product/feature affected (APIs/UIs, etc.). With that information, FW can update the status page with the details so the customer can refer to that information to adjust their business operation accordingly.

Recovery

At this stage, every action should be focused on how to resolve the incident as fast as possible to get the system back to normal. FW should ask the product engineer to give updates about the observable problem, suspected cause, and frequently give an update about the progress of the incident resolution. If necessary, FW can also ask help from an infra engineer who would have access to infrastructure to help product engineers resolve the incident efficiently. Effective communication, such as being clear, direct, and frequent communication, is essential during this stage to drive investigation and resolution.

Since we have a quite lot of experiences with incidents, we can ask the engineer to make fixes based on the previous cases such as to revert to the previous app version (if there was deployment prior to the incident that we suspect caused the incident to happen), restart the server, or upgrade the server’s resource. But we would also make sure that every action taken is always based on the data, no wild guesses at symptoms or causes at this very critical moment. Since we also have relatively high dependencies to the 3rd parties (banks, e-wallets, etc.), we can check first if our dependency is problematic so we can escalate to the PIC.

After the root cause has been found and a fix has been implemented, we would need to monitor for at least 15 minutes to check if the fix is effective to solve the problem or not. At this point, we update our status page to MONITORING so our customers know the latest progress. We mark the status page as RESOLVED when all of the affected systems are operating normally again after a monitoring period.

Reconciliation

Reconciliation is needed if there were broken processes that caused certain transactions/business processes to not work as expected, thus requiring intervention to complete the transactions. The objectives were:

Ensuring the data state is fully consistent internally and externally
Customers know how to reconcile and correct their systems
Customers can be informed whether they are safe to retry the requests

Insights

We have covered how we at Xendit handle the incident. But is there any proof indicating the above methods are effective? This is where data tells us.

Result

MTTR (Mean Time to Recovery)

MTTR, at the generic level, is defined as the average time it takes to repair a product or system failure. This includes the full outage–from the time the system or product fails to the time it operates normally again. The above graph shows the MTTR in minutes from two sets of timelines, quarterly-basis. The first one (left until the middle) was the range from January-December 2020 whereas the following one (middle until the right)is from January-December 2021. The snapshots from those ranges show that the average MTTR for the first period is around 279 minutes while the next period is slightly faster which is around 170 minutes. That tells us that the recovery procedure for incident handling at Xendit is improving from time to time. It proved that our focus to recover first is effective while still increasing our reliability to make sure that numbers of incidents are decreasing each time and our system is performing at the highest level.

% detected internally first

Xendit Detect First vs Customer Detect First

Another interesting graph to look into is the percentage of incidents detected before customers raise. It means that we can detect, acknowledge, and let customers know that we are aware of the incident and are continuously working to resolve that. From the graph above, we can tell that we detected most of the incidents first internally before customers are aware in the given period. It also shows that we increased our detection rate averagely from around 73% (May-September 2020) to around 86% (May-September 2021). That proved our monitoring system working well and our engineer on-call is acting quickly when there’s an alert notifying about potential incidents within our system. This is super important as customers will always get the latest and updated information about our system’s health without needing to reach out to us.

Improvements

We also made some small improvements to increase our incident handling process in certain areas. Here are a few of them:

Postmortem timeline generator: generate the timeline (when, who, what) to ease the postmortem docs filling process so we don’t need to do it manually.

Backup firewarden: act as a backup as well as a second person if there were 2 concurrent incidents happening at the same time.

Incident page status reminder: periodically remind us if there’s any open incident that has not been resolved.

Summary

At the end of the day, our goal for incident handling is to recover more quickly so that customers can continue operating their business without worry. We always strive for system reliability, stability, and put our customer’s interests first before anything else. These are several key takeaways that we think are worth mentioning.

The formation of roles and responsibilities in incident handling help us significantly to “orchestrate” the process, manage the impact, inform the customers, and eventually resolve the incident as fast as possible. This is reflected in the slight improvement of recovery time in the last year compared to several years ago when we didn’t have such a clear distinction.
Formalizing the training for the fire warden candidate is also a good way to fasten the process of transferring knowledge. We also have created a fire warden handbook, a living document that contains a set of knowledge in both theory and practice so that every fire warden can better understand the process of incident handling.
Improvement of the incident handling process is a lifelong journey. There are always more ways to increase the recovery time, to detect incidents early, and to make the impact minimum to our customers. We have learned that there are lots of factors to consider when we intend to improve this process: people, system, third-party, partners, and so on. That being said, we need to be careful when introducing new processes because the stakes are high and the risk is even higher at this level so every new idea should be “groomed” well, executed better, and always evaluated.

In the second post of the series, we’ll be talking about how we conduct a blameless post mortem as the subsequent process of the incident that happened before.