Sitemap
INA Digital Edu

Building technologies to create irreversible transformation in improving Indonesia’s education system.

Incidents Overview: What has happened in 2 years?

12 min readDec 11, 2024

--

Writers: Nadinastiti, Pertiwi Sapta Rini, Sentanu Eddy

Since the implementation of incident management 2 years ago, GovTech Edu has faced 165 technical incidents, 26 of which were triggered by third-party issues. This article explores the valuable lessons we have gained from navigating these challenges.

To better understand and learn from these incidents, we began by categorizing their root causes and also identifying their triggers. These initial categories were developed within the first few months of implementation and have been expanded over time to accommodate previously unclassified incidents as they arose.

Our approach to incident management has yielded significant improvements on infrastructure monitoring and alerts, capacity planning, testing processes, infrastructure requirements, and product metrics. Thus leading to reduced downtime, enhanced scalability, strengthened security, and streamlined deployment practices as well as to foster a culture of collaboration, documentation, and continuous learning within GovTech Edu.

Incident Root Causes

To understand why an incident happened, we start by analyzing its underlying causes. We use the “5 Whys framework” to help us analyze the real root cause of the incidents. One of the examples:

  • Why did this incident happen? The engineer needed to update data using database injection, but the executed query applied the wrong filter and lacked a transactional query, causing the query to run excessively long and update unnecessary files.
  • Why did the engineer need to update data via database injection? Because duplicated data existed in a single list.
  • Why was the executed query using the wrong filter and without a transactional query? There was a lack of proper end-to-end engineering workflow for updating data in production, including essential practices such as reviews or pairing with other engineers.

Even though we did not use literal 5 Whys, this analysis revealed that the root cause of this incident was an insufficient engineering process. In this case, the process did not include code review and pairing when updating data in production. Of course, we also use other root cause analysis techniques, as long as the real root cause is identified and accepted by the team.

To streamline and standardize the process of identifying root causes, a kind of root cause category mapping is needed. We adopted the three key elements of successful organizational transformation — people, process, and technology — as a starting point in establishing the root cause categorization. Using this framework, we identified and grouped all root causes from the initial months into well-defined categories. The root cause category mapping we have used over the past 2 years is presented below.

Incident Triggers

In addition to conducting root cause analysis, it is essential to identify when and where each incident first occurred. For instance, a significant number of incidents were triggered by incorrect changes made in production. Understanding this helps us gain deeper insights into the Change Failure Rate for each development team/tribe and pinpoint areas for improvement in our development cycle. Below, we have outlined the triggers of incidents that have occurred at GovTech Edu.

Post-Incident Improvements

In line with our commitment to user-centricity, the ultimate goal of understanding the incidents and their triggers is to learn from them and improve our value stream, ensuring we deliver the best possible services to our users — teachers, lecturers, headmasters, school/university administrators. After a few months of implementing our proposed incident management workflow, we have been able to pinpoint areas requiring organizational improvements.

To structure these improvements, we have adopted the concept of the Technology Value Stream coming from the DevOps domain. This framework helps group necessary enhancements and better aligns our efforts across the organization. The first step involves identifying all key participants in the value stream. According to Gene Kim, author of The DevOps Handbook, these roles typically include:

  1. Product Owner: Represents the internal voice of the business, defining the next set of functionalities for the service.
  2. Development: Builds application functionality for the service.
  3. Operations: Maintains the production environment and ensures required service levels are met.
  4. InfoSec: Ensures systems and data are secure.
  5. Quality Assurance (QA): Establishes feedback loops to ensure the service operates as intended.
  6. Release Managers: Manage and coordinate production deployment and release processes.
  7. Technology Executives or Value Stream Managers: Ensure the value stream meets or exceeds customer and organizational requirements from start to finish.

At GovTech Edu, these roles are adapted to our unique context. For instance, Development, Operations, QA, and InfoSec collaborate closely rather than working in separate sequential phases. Additionally, the responsibilities of the Release Manager are often handled by Product Engineers under the supervision of engineering leaders and with support from the Cloud Operations team.

Our proposed general Technology Value Stream, designed to improve incident management preventively and curatively, is depicted in the following diagram.

Technology Value Stream in GovTech Edu

To encourage more clarity, we further identified the specific streams associated with each member of the value stream. For example, the Release Manager is involved in multiple streams as we distinguish between deployment and code review activities. Some streams may include several members simultaneously, such as post-release, migration, and external/third-party implementations. A more detailed mapping is provided in the table below.

This detailed mapping allows us to scale collaboration more effectively across the organization. The following diagram illustrates the GovTech Edu Value Stream from the perspective of the stream groups.

Technology Value Stream in GovTech Edu from the perspective of the stream groups

Lessons learned for almost 2 years

We have been implementing the incident management workflow through executing the value stream described above since September 2022. This has provided us with much improved observability, enabling more effective incident handling. Below are the key lessons we have gathered from incidents occurring between Q4 2022 until Q3 2024.

Statistics

In two years, 165 incidents happened across our organization. 38.18% have SEV-3. Our busiest was in Q1 2023.

Improvements in priority from incidents

Excluding external causes, our top 5 priorities for improvements:

  1. Infrastructure monitoring and alert
  2. Infrastructure capacity planning
  3. Testing
  4. Infrastructure requirements
  5. Product monitoring and alerts

Improvements in infrastructure monitoring and alert

Based on our data, around 42 incidents or 25.45% of all incidents, have action items related to improvement in infrastructure monitoring and alerts. This shows that many of the incidents will be prevented by improving monitoring and alerts, especially regarding infrastructure.

The improvements needed in infrastructure monitoring and alerts mostly revolve around the following aspects:

Configuration

Our essential monitoring uses Grafana, and we configure the alerts to be sent to specialized alerting Slack channels and OpsGenie. Some of the incidents can be prevented if the alerts are configured properly, both in specialized Slack channels (to prevent alert fatigue) and to OpsGenie (so the on-calls will get automatic phone calls if the alerts are not acknowledged).

Threshold

Some of the incidents happened because the alert threshold is not set properly–either too low or too high–which can also cause alert fatigue. To improve, the team needs to evaluate monitoring and alerts periodically, especially when there are changes in business requirements. Layered alerts can also be an alternative to prioritized monitoring to increase on-call awareness. For example, the alert will be turned on when the metrics reach 80% and will also alert when it reaches 90%.

Coverage

Many incidents need improvement in monitoring coverage to prevent it from recurring, either on basic or specialized monitoring. Our Infrastructure team has already made guidelines for basic monitoring in our architecture review cadence. The basic alerts should cover at least these cases:

  1. No traffic
  2. Pod restart
  3. Memory utilization (usage vs request)
  4. CPU utilization (usage vs request)
  5. Database utilization (usage vs request)
  6. In-memory database utilization (usage vs request)
  7. 5xx requests

In addition, logs also need to be contextual, and help engineers do the troubleshooting. Besides the above basic monitoring and alerts, we also took note of some metrics that show in multiple incidents, such as database connections and queues. The team also might need to add more specialized metrics to get more accuracy, such as system memory usage ratio.

Improvements in infrastructure capacity planning

Around 27 incidents or 16.36% need better infrastructure capacity planning. This is usually also caused by external events which cause higher traffic since many users access the platform simultaneously. Our learnings:

Preparation is key

Besides proper communication with the relevant stakeholders about the schedule of external events, sometimes there are other factors that inevitably come into play so it is important for us to make better preparations.

How to prepare better? Use data-driven estimation for infrastructure provision. In addition, over-provision on infrastructure is better for expected high traffic to prepare for headroom for error (for BAU, 3 pods/services are considered safe, so we might need to provision more in case of events). Extra monitoring and alerts using separate war rooms are also good initiatives to involve all related internal stakeholders.

Team also had written guidelines on preparing for live events to ensure critical user journeys were prepared for high traffic, who to communicate with, and monitoring in the war room.

Load test

Besides the usual preparation, we should also check on feature readiness for high traffic, especially for newly released features. Make sure the features are well-prepared for high traffic by doing load tests and/or stress tests beforehand.

Improvements on testing

Out of 165 incidents, 19, or 11.52%, need improvements in testing. This includes the testing process and coverage.

Process

There are at least three aspects of the testing process that need to be improved. First, the existence of staging environments. Some parts of the platforms were not available in staging, and that caused the existing features not to be tested properly. There was no simulation in staging to prevent errors that caused the incident.

Second, there is collaboration between product engineers and QA. Pair testing–where developers and QA engineers collaborate to identify potential issues–can greatly enhance the quality of our platforms and prevent incidents.

Third, some of the team skipped testing intentionally, for example, because they were too hasty in making decisions for implementing changes. Remember that small details can cause incidents, so we better be prepared for any cases.

Coverage

It is also necessary to consider coverage of testing. Some of the cases we found that need more testing are:

  • Unhandled responses
  • Data integrity
  • Negative test cases
  • Lack of unit tests
  • Any fundamental changes
  • New features that is expected to be accessed by end users (public)
  • Events that expected to be accessed by much more traffic than usual, such as registration and submission deadlines
  • When there are deployment or upgrade of a service (this might need integration test)

Improvements in infrastructure requirements

Out of 165 incidents, 19 incidents or 11.52% need improvements on infrastructure requirements. This includes the testing process and coverage. Our lesson learned:

Configuration and Preparation: The Cornerstones of Stability

At the heart of our learnings lies the critical importance of meticulous configuration and thorough preparation. We’ve discovered that careful setup and planning can preemptively address many potential issues.

When dealing with Network Address Translation (NAT) and unbounded metrics, we’ve learned that caution is paramount. Understanding the implications of these elements on system performance and scalability has helped us prevent unforeseen bottlenecks. Similarly, we’ve found that properly configuring staging environments to mirror production settings closely is invaluable in catching issues before they impact our users.

Our experience with Google Kubernetes Engine (GKE) backup and restore operations has underscored the importance of infrastructure matching. We now ensure that the destination cluster’s setup, including node pools and firewall rules, aligns with the source to prevent post-restoration complications.

In low-traffic environments, we’ve found that adjusting Prometheus pod eviction settings can prevent unnecessary scaling issues. We’ve also learned to leverage gcloud configurations effectively, setting default variables like max_worker to safe values before running any services.

Our approach to metric handling has evolved, with a focus on avoiding dynamic endpoints in high-traffic Prometheus metrics. Instead, we’ve explored creating new metrics with Google Cloud Load Balancer for dynamic endpoints, enhancing our observability without compromising performance.

Failover Systems: The Unsung Heroes of Reliability

GovTech Edu’s journey has reinforced the critical role of robust failover systems in maintaining service continuity. We have invested significant effort in ensuring that these systems operate seamlessly, minimizing direct impact on users during incidents.

Our experience with Workload Identity has taught us to approach new technologies cautiously, particularly noting the instability of the STS method for GitLab CI. This awareness has led to more robust CI/CD pipelines and deployment strategies.

The optimization of emissary loads, especially in production environments, has emerged as a key focus area. By considering the separation of emissaries,we have found ways to reduce the burden on centralized systems, improving overall performance and reliability.

Drift checks have become an integral part of our infrastructure management strategy. These checks help detect and verify intentional changes, preventing configuration drift that could lead to unexpected behavior.

Lastly, we have fine-tuned our approach to Horizontal Pod Autoscaler (HPA) and load balancer configurations, recognizing our crucial role in maintaining system stability under varying loads. Coupled with ongoing efforts to monitor and optimize database performance, these measures form a comprehensive strategy for ensuring system reliability.

Improvements in product metrics and alerts

To address the need for improvement in product metrics and alerts, particularly given that 11.52% of total incidents fall into this category, it’s essential to derive actionable lessons from these incidents. Here are some key takeaways that can enhance product metrics and alerting systems:

Fostering a Culture of Collaboration

At the heart of GovTech Edu’s evolution lies a renewed focus on collaboration and documentation. We’ve recognized the critical importance of maintaining up-to-date documentation for cross-squad and cross-tribe collaboration. This emphasis on clear communication extends to code changes involving cross-teams, ensuring that all stakeholders are aware of the changes and their potential impacts.

Alert Monitoring

Alert monitoring has emerged as a cornerstone of our incident prevention strategy. We’ve significantly enhanced our alerting systems, implementing tools like Sentry to detect issues early. This proactive approach extends to both production and pre-release environments, with specific alerts set up for third-party service monitoring. Regular reviews and optimizations of alert configurations and rate limits have become standard practice, ensuring that we remain responsive to changing system dynamics.

Deployment Best Practices

Perhaps most notably, we’ve revolutionized our deployment best practices. We ensure the correct configuration of Helm secrets and Ambassador mappings before any deployment, significantly reducing the risk of configuration-related incidents. Thorough testing in development environments has become mandatory before pushing to production, and the organization has wisely chosen to avoid deploying major changes on Fridays, acknowledging the increased risk associated with weekend deployments. Well, we sometimes deploy on Fridays if it is an emergency ;). Performance optimization techniques, such as context cancellation in goroutines, have been implemented to address resource utilization issues.

Conclusions — Invest in Observability and Proactive Management

As we reflect on our journey at GovTech Edu, it’s clear that investing in observability and proactive management has been transformative for us. The high percentage of incidents related to monitoring, alerts, and infrastructure issues highlighted the critical need for robust observability practices. This realization has driven several key initiatives that have reshaped our approach to system reliability.

First and foremost, we’ve embraced a holistic approach to observability, recognizing that both product and infrastructure domains require significant improvements in monitoring and alerting systems. This comprehensive strategy has allowed us to implement end-to-end solutions that enhance our ability to detect and address potential issues before they escalate into major incidents. By doing so, we’ve reduced downtime and improved overall system reliability.

Proactive incident prevention has become a cornerstone of our strategy. By enhancing our monitoring capabilities, we can now identify issues early and take corrective actions swiftly. This proactive stance not only minimizes disruptions but also empowers us to make data-driven decisions. Improved observability provides valuable insights for capacity planning, performance optimization, and future improvements, enabling more informed decision-making processes.

Cross-functional collaboration has been another significant outcome of our enhanced observability tools. These tools facilitate better communication between development, operations, and other teams, leading to more efficient problem-solving and a shared understanding of system health. This collaborative environment fosters a culture of continuous improvement within GovTech Edu.

Our focus on scalability and performance has positioned us to handle traffic spikes effectively and ensure consistent performance during high-demand periods. Additionally, improvements in rate limiting, DDoS prevention, and overall system security have enhanced our platform’s resilience against external threats.

By prioritizing investments in observability across both product and infrastructure domains, we’ve significantly enhanced our ability to deliver reliable, high-performance services to our users. Combining those things with proactive management and continuous improvement sets a strong foundation for future growth and innovation in our organization. As we move forward, these lessons will continue to guide us in our mission to provide exceptional technology solutions in the education sector.

--

--

INA Digital Edu
INA Digital Edu

Published in INA Digital Edu

Building technologies to create irreversible transformation in improving Indonesia’s education system.

INA Digital Edu
INA Digital Edu

Written by INA Digital Edu

Building technologies to create irreversible transformation in improving Indonesia's education system.

Responses (1)