How We Keep Our Government Apps Running With High Reliability: A Peek at Our Incident Management Strategy

GovTech Edu
GovTech Edu

--

Get to know Nadinastiti, our Technical Program Manager, and Estu Fardani, our Cloud Platform Engineer at GovTech Edu, as they shed light on the best practices for incident management in GovTech Edu’s engineering teams. Drawing from their extensive experience, Nadinastiti and Estu are well-equipped to offer insights that can help streamline and optimize GovTech Edu’s incident management process. With years of experience under their belts, Nadinastiti and Estu bring their expertise to ensure that GovTech Edu’s incident management process is efficient and effective.

Nowadays, people use mobile and desktop apps like e-commerce, ride-hailing, and daily food delivery. Issues like declined payments, longer navigation routes, and unsent messages are things people sometimes experience. The public sector’s apps are also no exception. Many government/public sector institutions have mobile and desktop apps for serving the public, which many probably have used at least once. The large user base in Indonesia makes those kinds of issues unavoidable. It is not a rare thing to see complaints regarding public sector institutions’ apps on social media.

If lucky, those issues are solved within an hour. However, it is also common for government app technical issues to persist for months or even years. How can some technical issues be solved in a very different way? Is it possible to ensure that all the technical issues are solved as fast as possible and that teams learn how to prevent their mistakes in the future?

System Reliability Must Be Prioritized

As the Indonesian government’s thought and development partner, GovTech Edu has built many different platforms to enable inclusive access to education:

  • Quality learning content for teachers (Merdeka Mengajar app)
  • Data-driven decision-making in schools (Rapor Pendidikan)
  • A collaboration between Universities, Industry Partners, & Graduates (Kampus Merdeka)
  • School procurement system (ARKAS and SIPLah)
  • Single sign-on account that provides access to various educational platforms (belajar.id)

Those platforms are used by at least 500,000 teachers and 104,000 schools. Belajar.id accounts are used by 13 million students, teachers, principals, school operators, and edu administrators. It is clear that GovTech Edu must establish an incident response workflow to ensure smooth access for the platforms’ huge user base in case emergencies happen.

Incident response workflow started rolling out in GovTech Edu in September 2022. Based on our latest data, the GovTech Edu team’s Mean Time To Recover (MTTR) for incidents that impact users directly (Sev-1, Sev-2, Sev-3) in September-December 2022 is around 3 hours and 52 minutes. For January and February 2023, more than 4 months after rolling out the incident management workflow, the MTTR is 1 hour 28 minutes–which is more than 50% improvement!

That shows that it is possible for the government to solve technical issues quickly and efficiently and how incident management helps teams to handle and prevent incidents. How do we do that?

Putting Users as Center of Our Work

Put User First. One of our values we believe in GovTech Edu

Managerial support is a must-have for a working incident response flow. A lack of support from those in charge can lead to a lack of motivation and direction–including in implementing incident workflow. GovTech Edu wants to #PutUserFirst, which means users are the center of our work. To adhere to that value, we prioritize platform and system reliability to ensure a seamless user experience–even when the unexpected happens.

Making a safe space to make mistakes

In addition, we also need assurance that the organization is a safe space to make mistakes. By creating a comfortable environment of openness and accountability, no one will hide their mishaps for their safety which potentially worsens the incident severity. One of the ways to make a safe space is by making a blameless culture, including in postmortem review.

On-call rotations

Since we have a huge base of users, our platforms must also be accessible at any time. Thus, we need to adjust our workflow to guarantee reliability just like our counterparts in the private sector (e-commerce, ride-hailing), even during weekends and holidays. One of the workflow adjustments is on-call rotations. The team can ensure that internal and external system disruptions will be handled as soon as possible by doing on-call rotations. How do we do that?

We start by creating an on-call schedule. By using alerting systems like PagerDuty or OpsGenie, we can create automatic on-call schedules integrated with monitoring systems such as Grafana and Datadog. If anything happens with the metrics we have set, the systems will automatically call the on-call using multiple channels, like call, SMS, trigger apps on smartphones, email, and group chats as the first one who gets notified. It will also escalate the call to a higher hierarchy if needed.

It depends on each team, but the on-call(s) should be responsible for handling emergencies, short-term ad-hoc tasks, and/or any team-routine activity. They should be expected to take on fewer tasks than usual so that they can focus on being on calls. Each on-call session has a 1-week (24 hours x 7 days) duration with at least one engineer. To prevent burnout, one team should have enough engineers so that each engineer has enough time to rest before the subsequent on-call sessions. At the end of each session, the on-calls will write a handover report so that the next on-call engineer can continue the work seamlessly.

Incident management workflow in GovTech edu

Why do we need incident management?

Before the incident management workflow was rolled out, any production incidents in GovTech Edu happened in silos in each squad or tribe. Incidents are only reported to all teams once the team needs help from the infrastructure team. There was no official way to report incident lead-ups, and postmortems were also not required. Moreover, in each incident war room, no official role was responsible for reporting and resolving incidents.

By not reporting to all teams, changes during incidents are not acknowledged, captured, and documented properly. Not doing so also causes lessons learned or knowledge acquired to be lost, and we might repeat the same incidents. This is why GovTech Edu needs proper incident management to ensure the chaos in incident events is organized into swift incident resolution.

Incident management workflow

Finally, now we have a working incident workflow comprising three steps: incident identification, incident response and remediation, and incident analysis. In incident identification, we focus on encouraging reporting habits. We provide a Slack workflow to report any lead-ups so that all engineering teams are aware of the possible incident. In this phase, we also focus on identifying whether the incident is valid or not.

After deciding that the incident is happening, we focus on stopping further damage or loss of services, resolving the source of issues, and recovering the user journey. This means we strive to return our users’ journey so that they can continue their business as usual as soon as possible. We also pay attention to the reparation for the resulting damage. Following this, we also decide which severity level the incident falls into. We use 5 categories of severity, in which Sev-1 is for the most critical incidents (such as impacting more than 3% DAU of the product line), and Sev-5 is for minor incidents that only impact a small number of users.

We must ensure the incident will not happen again. That is why we must uncover the root cause by documenting the postmortem. By writing it down, all teams can read the lessons learned. We also decide on actionable prevention steps in the postmortem. For Sev-1, Sev-2, and Sev-3, the postmortem will be reviewed to ensure relevant stakeholders acknowledge postmortem quality and action items.

Postmortem and its benefits

As mentioned above, this postmortem will be a safe space because it will be blameless. This means that no one will be blamed for their errors. We will refer to them as their roles–not names, in any written document. We would also ask “why did the system allow them to do this, or lead them to believe this was the right thing to do” instead of “why did individual X do this”. Then, we will focus on the following steps to prevent the incident from happening again.

By documenting postmortem, the internal team can also learn from other teams’ incidents and apply the lessons to their own. We can also see the common root causes at the organizational level and make decisions on a higher level.

Postmortem details

Our documents contain three main parts: summary, report, and incident notes. Below are the explanations for each part.

Postmortem Summary

This part summarizes the incident to facilitate a quick and comprehensive understanding of what occurred. We usually write in a single paragraph with 3–4 sentences. Below is what information we include in our postmortem summary.

Postmortem report

We write all the details about the incident, such as:

  1. Incident detection. It can come from monitoring dashboards and alerts or from user reports.
  2. Trigger: list wrong behavior from the system, and it is not expected.
  3. Impact: What users cannot do from this incident
  4. Resolution/Recovery: list of actions to do when an incident happens. It also contains the action to restore the impacted service to the state before the incident.
  5. Root Cause: Real cause of why this incident happened

We also captured the timeline, which contains a time series with events or actions at that time–before incidents, in the war room, or when incidents happened.

Post-incident notes

After the details on what happened in the incidents, we write these points to complete our postmortem. Here are the points:

  1. “What went well?” partly explains what we already have and what is working well and might have prevented the incident from going bigger.
  2. “What did not go well?” part explains what does not usually happen and goes wrong.
  3. Follow-up action items are a list of to-do’s to prevent similar incidents from happening again.
  4. “Lesson learned” is a list of what we have learned about this incident and what might be implemented in other products or teams.

Wrap-up Statement

To sum up, incident management is a crucial process to maintain system reliability. We can see that this workflow improves GovTech Edu’s mean time to resolve incidents between 2022 and 2023. Higher-ups' support to prioritize system reliability is necessary to ensure this is implemented in the whole organization.

This incident management workflow can be reproduced freely, especially in government institutions systems that provide public services.

Authors Bio

Nadinastiti, Technical Program Manager in GovTech Edu. Having been involved in incident management in previous companies, she now leads the implementation and adoption of this critical workflow for the engineering team at GovTech Edu.

Estu Fardani, Cloud Platform Engineer in GovTech Edu. He is part of cloud operation and security teams, including joining on-call rotation members. Estu started working with the cloud in 2015 until now. He contributes to some open-source projects like BlankOn and openSUSE and often joins as a speaker for open-source community events, local and international.

--

--