Lessons Learned from Asesmen Nasional: Serving ~80k rps Efficiently on the Cloud

Published in

GovTech Edu

16 min readMar 22, 2024

Writers: Muhammad Saiful Islam, Alifia, and Nadinastiti

Asesmen Nasional is an annual education evaluation program by the Ministry of Education, Culture, Research, and Technology that aims to improve the quality of education by evaluating student learning outcomes, teaching quality, and learning environment.

It was introduced as part of the Merdeka Belajar Episode 1 policy package in December 2019 and has been conducted since 2021. The program utilizes various instruments, such as Asesmen Kompetensi Minimum (AKM), Survei Karakter, and Survei Lingkungan Belajar, to provide crucial data for schools and local governments to overcome challenges and enhance education through the Rapor Pendidikan platform.

Students participate in a two-day online or semi-online assessment, with the majority opting for the former. To support the large number of students participating in real-time, the ministry used Google Cloud to host the system for its scalability and pay-as-you-go model.

Picture this: it’s October 23, 2023, around 7:30 AM in Western Indonesian Time, when most Indonesian students participating in Asesmen Nasional started their assessment session. The first browser tab we opened was the monitoring dashboard, showing traffic reaching 20,000 requests per second (rps). Those requests came from students from Central and Eastern Indonesian Time provinces, who had their session started earlier according to their local time.

That was a special day for all of the technical parties involved, together with our counterparts from the ministry. A lot of preparation was done, as that was the first day of the assessment session for our elementary school students — the largest population of the nationwide assessment.

The traffic peaked at 8:24 AM at 78.430 rps being processed by the external load balancer. At peak, we recorded around half a million concurrent users, and half of the traffic was writes to our databases. We kept looking at the p95 latency panel of the “web-siswa” component that students accessed. The latency was the part where we’ve done much work throughout the year.

At peak traffic, the system responded to almost all requests within ~30 ms — it only hiccupped to under 60 ms when the second session began at 11:00 AM local time. That day, we served more than 1 million students participating in the assessment with more than 50,000 proctors online.

Moreover, the error rate didn’t reach 0,08% of all traffic.

The statistics above may be business as usual for those who are used to handling large-scale, high-traffic web apps. However, we couldn’t hold back our tears during the session, remembering the continuous journey of performance tests, investigations, and optimizations, trying to optimize a system that used to have an average latency of seconds instead of milliseconds and a single-digit percent error rate instead of 0,08%.

We want to take this opportunity to reflect on our journey and share what we have learned to ensure the success of Asesmen Nasional, both technically and non-technically.

Establish a way of working

On receiving the assignment from the ministry, we discovered these parties would be involved in running Asesmen Nasional that year:

Pusat Asesmen Pendidikan (Pusmendik, or the Center for Education Assessment), which owns and develops the whole Asesmen Nasional program;
Pusat Data dan Informasi (Pusdatin, or the Center for Data and Information), which manages the IT infrastructure for the ministry, including the cloud;
Radya Labs, a digital agency chosen by Pusmendik to support the scalability improvements of the assessment system for the “online” mode, utilizing cloud technologies;
GovTech Edu, which already supports the ministry under the Pusdatin coordination to build and run technology-based ecosystems. In this case, the ministry considered our experience running the ministry platforms on top of Google Cloud relevant to supporting Asesmen Nasional.

Highlighting the number of parties and the technical challenges involved, we assembled a task force consisting of the Cloud Platform, Core QA, Software Development Engineers in Test (SDET), and Technical Program Management (TPM) teams. A combination of engineering and program managers ensured the task force always had a holistic view when solving problems.

Since we have many stakeholders in this nationwide program, it is a must to establish a communication plan to ensure that all stakeholders are aligned with the objectives and progress of the initiative.

We also established a written way of working and a timeline with Pusmendik to ensure that our scope of work is clear and distinct from other parties. The timeline was maintained using a Google Sheets document and served as a source of truth for all parties involved.

Architecture review

We rolled our sleeves with an architecture review at the beginning of the year with Pusmendik and Radya Labs.

The platform to conduct Asesmen Nasional has been present since the first assessment in 2021. It was built using the .NET Framework and ran on top of Microsoft stacks like Windows Server and SQL Server. However, running the infrastructure with the licensing fees was costly, so the platform was converted to run on Linux-based containers, albeit running on a static number of Ubuntu-based VMs without container orchestrators.

Based on that background, we planned to utilize the Google Kubernetes Engine to run the platform. The goal is to run the infrastructure more efficiently by configuring the cluster to scale up and down automatically as the traffic comes in, combining the Horizontal Pod Autoscaler (HPA) and GKE cluster autoscaling features.

The architecture review sessions were valuable in integrating the expertise between the teams. Radya Labs was already familiar with all the use cases and the functional side, so we worked together to formulate the requirements for the performance and infrastructure side. This session highlights the importance of the multi-discipline composition of our task force, which was able to have a holistic view in this discussion and critically ask questions related to edge cases and scalability challenges.

Simplified architecture diagram of the Asesmen Nasional platform, running on Google Cloud.

Several key designs were discussed and decided:

We will utilize Google-managed services as much as possible instead of running the infrastructure components ourselves. This allowed us to focus on the application side more closely, while most of the work to ensure the infrastructure runs with high availability and reliability will be handled by Google Cloud.
The database will be running on Cloud SQL PostgreSQL. Using an RDBMS solution allowed Radya Labs to keep using the paradigm familiar to Pusmendik as the program owner. It also has a Query Insight feature, which allows the team to identify less-performant queries without additional cost.
However, based on the infrastructure scale in the previous years, it was clear that we could not use a single database instance to keep up with all the write-heavy traffic to store student responses. Hence, the student response database will be sharded, and the application side will need to be modified to incorporate the sharding logic. (The authentication and assessment questions were not sharded because they are all read-heavy instead of write-heavy.)
As the platform will utilize pod and cluster autoscaling, ensuring that the application side is stateless is crucial, which was not the case.
A business logic to log all the students’ actions was identified. This feature is vital for Pusmendik to handle issues and disputes regarding the assessment. However, it was clear that the log didn’t need to be processed and stored synchronously. The teams then introduced a message queue to log the students’ actions utilizing Cloud Pub/Sub. This way, we can keep storing the student action log in a single database instance — no need to shard the logs, despite the heavy traffic coming in.
A combination of Cloud Storage and Cloud CDN was used to offload static asset traffic from the application. This allowed us to maximize the application’s resources to handle the most critical part: allowing students to fetch questions and respond to them.

Ensure good system observability

One of the principles we applied to all MoECRT products developed by GovTech Edu is to ensure our products have good observability. This way, we were able to have a good understanding of how the system behaves.

On Asesmen Nasional, we observed the platform through metrics and logs (this is separate from the student action logs discussed before). We utilized the built-in Cloud Monitoring and Cloud Logging stack to do this. Google Cloud already manages most dependencies, so they automatically emit metrics and logs to our observability stack.

For the application side, we worked with Radya Labs to emit metrics in Prometheus format, which Google Cloud Managed Service for Prometheus then ingested. Logs were also emitted from the application side and captured by Cloud Logging. Furthermore, the log format was suitable for Google Cloud’s Error Reporting feature to identify new errors as they come up on Cloud Logging, allowing us to track them and ensure they were addressed promptly.

Ensuring observability in the first place also helped us establish a baseline and iterate between performance tests.

Establish targets and prepare for performance tests

We also discussed with Pusmendik to get their requirements on the amount of traffic needed to be handled by the platform.

Pusmendik has established a traffic pattern based on assessment sessions in 2021 and 2022. However, as the architecture would be different in 2023, we needed to perform performance tests to ensure it could cope with the estimated traffic.

We initially estimated 70,000 rps peak traffic for the assessment sessions. However, Pusmendik needed to anticipate a worst-case scenario where all schools opted for the first schedule available during the rehearsal (gladi resik) and established a soft target of 210,000 rps for the performance test instead to anticipate a total of 2,200,000 students rehearsing at the same time.

To support the soft target, the Core QA team at GovTech Edu created a performance test platform that generated up to 240,000 rps of traffic. Targeting this traffic level also gives us a 2x-3x safety margin for any potential intermittent spikes during peak load. The journey of building this platform is worth a dedicated blog post, which you can read more about in this article.

Unleashing Testing at Scale: How GovTech Edu Built a 200K RPS Load Testing Platform

Writers: Hamnah Suhaeri, Damarananta & Estu Fardani

medium.com

We then planned for the performance tests to be conducted in multiple short sessions between May and July. Based on the performance test results, Radya Labs will keep improving the application side iteratively.

However, as the progress and results came up, it was clear that continuous improvement of the platform was needed while we ran the assessment sessions. Hence, we worked with Pusmendik and Pusdatin to allocate 5-day performance tests each month, from August to October. This way, Radya Labs can evaluate the actual metrics on the production, develop improvements needed, and have that implementation validated before deploying it to production.

Many parties, both technical and bureaucratic, were involved in performance tests. The TPM team heavily handled this challenge, jumping through hoops to ensure we could do the technical work.

Plan and conduct performance tests

Performance testing is more than just throwing traffic to the service as much as possible. From the quality team perspective, we acknowledged that executing the performance test for Asesmen Nasional requires careful planning. Therefore, we have created an elaborate yet flexible Test Plan document. The key aspects we have planned for include:

Appropriate performance test types were used: We opted for load, endurance, and spike tests. Load tests helped determine the system’s capacity to handle traffic. Endurance tests assessed the system’s performance over an extended period (approximately one hour). Finally, spike tests were specifically designed to evaluate login performance. We anticipated a significant increase in traffic during simultaneous student logins at the beginning of the assessment session. Each performance test required different configurations, which we learned and anticipated beforehand.
Well-defined entry criteria: Given the involvement of multiple parties in performance testing, we wanted to ensure that the performance test is ready from the perspective of system configuration, test data, and scripts. With around a dozen individuals participating in the performance test session, unresolved issues could lead to postponement or a collective wait for fixes. Therefore, entry criteria must be consistently met before the performance test commences.
Clear acceptance criteria: Establishing clear acceptance criteria involves extensive discussions with various stakeholders. The team must have clear acceptance criteria to determine the system’s acceptability. The agreed-upon criteria include the system’s ability to handle 210,000 requests per second (rps) without students experiencing blocking during the exam. The second criterion, albeit challenging to translate into quantified metrics such as error rate or latency, we hypothesized that by creating scenario scripts that closely resemble user behavior and conducting manual testing during the performance test, it would allow us to understand the user experience during heavy traffic, uncovering some challenging edge cases that are hard to identify under normal traffic conditions.
Test scenario resembling actual users: Instead of using simple test configuration like mindlessly flooding several endpoints with specific target traffic, we modeled the test scenario to simulate many virtual users (VUs) doing what a student will do: accessing the home page, submitting their credentials in the login page, confirming their assessment session details, and reading questions and answering those in sequence. We also scripted that a VU would stop accessing the following endpoint if they got presented with system failures or unexpected responses on the previous steps. This paradigm makes a difference in the performance test report and provides insights into the system’s bottleneck. In addition to standard metrics such as requests per second (rps), error rate, and latency, a new metric has emerged that can enhance confidence in the system’s performance: the funnel rate.

An example of the funnel rate diagram generated from a performance test session. The concentration of users becomes more sparse as we go down the funnel because different users have varied waiting times between each step, broadening the standard deviation of the distribution and reducing the peak traffic. This diagram showcases the bottleneck on the login page due to having the most significant peak traffic, which Radya Labs then worked on.

The monthly performance tests enabled us to focus on fixing the feedback after each test, knowing we will have another session next month. Those tests helped us identify bottlenecks; we improved those and repeated the test until all stakeholders were satisfied.

Premortem session

As the day drew near, we gathered all the stakeholders to do a premortem. Premortem is a program management strategy to improve a program by assuming the plan went wrong and generating plausible reasons for the failure. First, we imagine if the assessment session gets chaotic. Then, we analyzed why and how we could mitigate the chaos and assigned PICs. Here are a few examples from our premortem session:

This premortem exercise uncovers potential blind spots that might seem obvious in one team but remain obscure in another. For example, the Pusmendik operational team has taken measures to ensure that the schedules, sessions, and operational procedures are designed to prevent an overflow of participants. However, we needed to know those details as a separate technical team. During the premortem session, it became clear to all participants that the current performance test target was sufficient, and there was no need for additional tests.

The diverse perspectives from all stakeholders significantly enrich the premortem. This method is highly recommended to prepare the team mentally for the impending event. Fortunately, none of the scenarios materialized in the assessment sessions. Nevertheless, our proactive preparation positioned us well to tackle unforeseen challenges.

Capacity planning and optimization

We worked with the application development team to estimate the infrastructure capacity needed based on the performance test results. We then communicated the estimated capacity regularly with Google Cloud TAM to ensure we had the capacity needed to run the platform.

The performance tests are all synthetic users, though, so as the system shows its performance on production during the early assessment sessions, we fine-tuned the capacity and made a forecast to anticipate the next assessment sessions with more students.

Infrastructure resizing on schedule

As the platform runs on the public cloud with a pay-as-you-go model, we also saw an opportunity to save the cost by downsizing the entire infrastructure on hours with no assessment activity: 6 PM until 4 AM in the morning (Western Indonesian Time). We decided to downsize instead of shutting down the entire infrastructure as there will be low-traffic activities done by Pusmendik, such as retrieving student logs as needed or simply configuring some assessment session parameters to prepare for the next day.

The infrastructure resizing was done with shell scripts running on a VM. The resize scripts are triggered by standard Linux cron. The scripts and cron schedule are maintained using a set of Ansible playbooks. The application development team evaluates the infrastructure sizing daily based on metrics and forecasted traffic for the following day.

There are several infrastructure components that were downsized. Firstly, the Kubernetes cluster. We did it by provisioning two node pools:

One node pool is always on (we call this the “minimum” node pool) with a static number of nodes. We sized the node pool so the application deployments could run with 1 pod (the minimum number of pods configured on the pod autoscaler).
One node pool (the “maximum” node pool) will get autoscaled. The cluster autoscaler will increase the number of nodes in a node pool from 0 to a maximum limit. Then, the resize scripts will modify this node pool by scaling it down to 0 nodes, deactivating the autoscaler on this node pool during the evening, and reactivating it in the early morning.

Secondly, the Cloud SQL instances. We did it by downsizing the CPU and memory in the evening and upsizing the CPU and memory back in the early morning.

Lastly, the Redis instances. We did it by downsizing the memory in the evening after flushing all the data, and upsizing it back in the early morning.

Always prepare for failures: One thing we anticipated when implementing the infrastructure resizer is to always prepare for failures. This was done by:

Configuring a log-based alert on the Cloud Monitoring side for errors emitted by the infrastructure resizer. We then configured the alert to wake us up through Opsgenie.
Allocating buffer time to anticipate resize operation failures. As we need the platform ready by 5.30 AM, we schedule the upsize operations at 4 AM. This way, we have 1,5 hours to handle the situation if the operations fail.

Thankfully, those preparations taught us that the infrastructure resizing operations could actually fail.

We encountered this case on Cloud SQL instances: downsizing 20+ instances in the evening to 2 cores of CPU and 4 GB of memory worked seamlessly, but sometimes, we failed to upsize them back to their previous size in one shot. We didn’t expect this to happen initially, as our Google Cloud counterparts have made necessary capacity reservations for this high-stake ministry event. However, retrying the upsize operation will work.

Collaborating with our Google Cloud TAM revealed that a combination of many instances and their large sizing may cause the upsize operations to fail occasionally if they are done in one shot. However, a particular mechanism on the cloud provider side will cause the operation to succeed if it is retried.

We then implemented exponential backoff to the resize script to always retry the resize operations on all components if they encountered failures and to alert us if the retry operations failed for 30 minutes. This way, we still have one hour to anticipate any failures, which is still acceptable.

Continuous improvements and canary deployments

Due to good system and application observability, we could monitor the performance and find issues as students do their assessments. Most of the issues can be fixed straightforwardly. However, some of the issues require us to experiment using canary deployments. We do canary deployments by deploying changes to a single Kubernetes pod and testing it before rolling out all the pods.

Optimizing memory usage by profiling the canary deployments

One particular issue we faced was a resource and performance issue on the student application side: it consumed a large amount of memory (single-digit GB). Despite healthy CPU usage (around 65%), the p95 latency was measured in seconds instead of milliseconds.

Due to good observability, we eliminated slow database queries as much as possible. The application development team did many query rewrites, refactored several flows on the application side, and many more. We started to assume that the application requires a large amount of memory. Still, despite many efforts, we could not bring the p95 latency down.

One thing that the application development team did was profiling the canary deployment. This was done by running a .NET profiler inside the Kubernetes pod and inspecting the profile. From the profile, we found that the logging library usage has yet to be tuned: it processed the logs in batch, and the library needs to be configured to limit the logs in the queue.

The application development team configured the library to limit the log queue and disabled several system logs. After canarying it on a single Kubernetes pod, the memory consumption and p95 latency decreased significantly. We rolled out that change to all the pods, confident it would work as expected.

The monitoring panel that brought us to tears — the p95 latency dropped from seconds to milliseconds after the application development team limited the log queue and disabled several system logs based on the canary deployment memory profile.

Moreover, the error rate also dropped significantly.

D-days

Our ways of working during the D-days, starting from Uji Kesetaraan in May, diverge slightly from the preparation phase. Here’s how we do it:

Create a schedule for the monitoring room. We did it over Zoom between Pusmendik and all technical teams in the morning and evening.
Create a centralized sheet to capture (a) issues log and progress, (b) monitoring room discussion and action items, and (c) traffic data recapitulation (which includes total participants, peak traffic, and other relevant metrics).

This structured approach ensures that any feedback from the event is acknowledged across team members, enabling us to find solutions faster. The centralized sheet facilitates issue tracking and serves as a meeting notes and critical traffic data repository. The traffic data collected in 2023 will be invaluable for predictive insights in the coming years.

Finally!

The system handled over 1 million users on a single day, with the traffic peaking at ~80,000 rps with ~30 ms p95 latency. We achieved it through a collaborative effort between the Pusmendik operational team’s meticulous scheduling and the technical team’s robust preparation for the massive traffic.

Overall, the Asesmen Nasional went relatively smoothly, marked by a notable absence of significant issues.

About the Team

Muhammad Saiful Islam is a Cloud Platform Architect at GovTech Edu. He has always been excited to help the ministry build and run tech-enabled products that scale reliably, as he believes in the impact of working for the public service in the long run.

Alifia serves as a Head of Quality Engineering at GovTech Edu. She believes the foundation of testing goes deeper than checking requirements. It has become her mission to empower SQAs to make continuous improvements and explore innovative testing methodologies because effective and context-driven testing not only ensures product quality but also fosters trust and confidence among stakeholders.

Nadinastiti is a Technical Program Manager at GovTech Edu. Having managed product development and infrastructure programs in previous companies, she leads the GovTech Edu task force to support Asesmen Nasional.

Does this scale of challenge interest you? Join us here: https://www.govtechedu.id/career!