How a government app reach a 99.97% crash-free rate for 1.3M+ MAUs

Published in

GovTech Edu

7 min readDec 5, 2022

PMM (Platform Merdeka Mengajar) is a teacher super app built by GovTech Edu. It has many features to support teachers’ teaching process and increase quality. At the time of writing, we have 2+ million downloads, hundreds of thousands of DAU, a tiny download size (4MB), a 4.7+ rating on Play Store from 60K+ reviews, and a 99.97% daily crash-free rate for users. We have 11 android engineers who all work remotely from different cities nationwide. You can read more about PMM here.

All those good numbers are just “numbers”; the real question is, how do we do it? In this post, we want to share some tips/practices we did in GovTech Edu that keep us maintaining this app at the highest level, specifically on crash-free users. Crash-free rate is one of the most critical vital performance metrics, showing how stable your apps perform. A 99.97% crash-free rate is impressive for super apps with more than 1.3 million MAUs, especially considering that the standard benchmark for the tech industry is around 99.00% instead. Ok, enough with the chit/chat☺️ Let’s jump to the first practice below.

Test Your Code

Well, this sounds cliche (and not that simple), but it is true. Well, it may correlate indirectly, in any case. When you test your code, you must rethink the logic and prevent yourself from making a silly bug (that may trigger a crash). Our engineers at GovTech Edu have serious concerns about this. We did agree to strictly write a unit test, e2e test, and UI test (test pyramid). We have a pipeline to check when the coverage went down, and we also committed to a certain number as minimum coverage. The QA platform team supports this ecosystem in building the testing tools and pipelines.

Toggle Everythings

As we grow and develop more and more features, sometimes it just breaks on production. No matter how much test coverage you did on your codes, sh*t happens. You name it, whether triggered by a faulty server or notification payload. That’s why we always put a toggle almost in our features. Mobile apps have a problematic issue with the adoption rate (unlike the web). You must do more than make an update and hope everyone installs the latest version. Whenever we detect an unusual crash rate on a specific feature, we toggle it off while trying to fix the problem. Most of the time, it doesn’t require any new release as it resolves on the backend side, so we can toggle it on after the fixed deployment. Firebase remote config is a good-free, toggle tool. It has a simple key-value store and is customizable with specific filters like app version and build type.

Apps on-Call

Remember the term, “When everyone’s responsible, no one’s responsible.” It is true, at least for our team at GovTech Edu. We agreed to schedule an on-call person each week to dedicate to monitoring, fast response, and do-any-other-ops-thing-do. At first, we thought this was unnecessary since apps don’t release every day or even in a week. We should have noticed that the triggered factor not only comes from the apps themselves but can also be external, like service deployment, toggle changes, and provision updates, directly affecting apps.

When crash encounters, the on-call is the first person to respond, deciding the fixing strategy or asking for help from others. On-call is also not a robot; they need tooling to keep it. We have a visual and public alert posted to our slack channel, so the alert would quickly notify on-call. A simple firebase alert with a slack webhook is what we use. We know there is a better alert option like opsgenie, but it is good enough for our team. The following strategy is to adjust the alert. You have to avoid making the alert like a false alarm that everyone will eventually ignore. How you adjust the alert depends on your apps and team: how many users you have, what is the current crash-free average you have, there is any recent spiking crash that you’re afraid of. Think about it, discuss it with all related members, and you’ll find the right tune.

Be picky about choosing a library.

Error in the library sometimes is inevitable. Sometimes you think your codes are perfect, you passed the test, and then suddenly, it crashes on the library code. We encountered that many times. We must be more careful about picking which library is better than others. So, the first thing first to check the number of open issues. If the number is relatively high, we should prevent it. Then, use kotlin library instead of java. We found that the stability improved after migrating the java library to the kotlin version. Find a library with an excellent reputation based on the number of contributors and the last update. We should prevent libraries that have been inactive for more than 2–3 years. It potentially does not sync with the latest OS update. You may find those three criteria separately, but you can discuss them with your team or ask the android community to analyze them further.

Rollout strategy

We are convenient google play store has a staged rollout feature. We very recommend releasing apps with a staged rollout. It is a deadly simple defensive procedure to prevent potential big crashes. We usually start a rollout at below 10%. Why this number? Because it’s the best number to get an early sample without impacting too many users. After one or two days after release, We can see the crash trends. In this state, you have to answer these questions: Are any new crashes been found? Are these crashes potentially widespread? Do we have to release a hotfix to fix this recent crash? These questions should be decided together by Managers, QAs, and engineers. The trade-off always relates to a longer release time, So make sure your hotfix is worth it. You want to avoid fixing a minor crash and missing the target launch, which may impact your company’s business.

Multiple OS and device manufacture testing

Have you encountered some bugs or crashes that only occur in specific devices or OS? I’m sure we all have. It’s a good idea to test your apps on different OS and device manufacturers. We can do a smoke test on all OS targets and pick the top 10 device models based on user adoption. It is paying more attention to a “problematic” device model, which usually produces crashes. Once we did this, it prevented the big crash before the production release. It’ll be more manageable once we can afford a device farm. Firebase test lab has a free but limited device farm; you could start from there. There are also many paid cloud-based device farms nowadays with very reasonable prices.

Achieve it clearly through OKR

Lastly, Your entire organization should be aware of and support your team to achieve this objective. One of the standard tools to align team goals is using an OKR. You can read more about OKR here. Why is this important? Because to reach 99,9x % crash-free, you need help and support from others to do it. When it’s on the OKR, others would think it means pretty severe. They will notice you, notice your objective, and start to help you. It will catalyze your bumpy road. Your OKR should be progressing as well. Set it at a minimum at first, like 99% crash free at Q1, then raise it to 99.3% in Q2, and so on until you maintain it to the highest level possible.