Postmortem: Web Application Outage Incident

Bereket Assefa
2 min readAug 13, 2023

--

Issue Summary: Duration: August 10, 2023, 14:00 — August 11, 2023, 03:30 (UTC) Impact: The web application experienced a widespread outage for approximately 13.5 hours, affecting 60% of users. Users reported slow loading times, timeouts, and inability to access key features.

Timeline:

  • 14:00: Issue detected by automated monitoring system as server response times exceeded acceptable thresholds.
  • 14:15: Engineering team alerted via SMS and email notifications.
  • 14:30: Initial investigation initiated, focusing on database performance and server health.
  • 15:45: Assumed database overload as root cause due to recent traffic surge.
  • 16:30: Database scaling performed, but no improvement observed in the application’s performance.
  • 17:15: Escalated incident to senior engineering team for further assistance.
  • 18:00: Deep-dive analysis revealed high CPU utilization on application servers.
  • 19:30: Investigation took a misleading path towards identifying a potential DDoS attack.
  • 21:00: Security team confirmed no signs of malicious activity; attention shifted back to internal issues.
  • 22:45: Realized a memory leak in a recently deployed microservice might be causing the slowdown.
  • 00:30: Incident escalated to DevOps for immediate collaboration.
  • 02:15: Memory leak identified and patched in the microservice code.
  • 03:00: Application performance gradually improved after patch.
  • 03:30: Full recovery achieved; services resumed normal operation.

Root Cause and Resolution: Root Cause: The root cause of the issue was a memory leak in a newly deployed microservice. The memory leak caused gradual degradation of the application’s performance, leading to increased CPU utilization and slow response times.

Resolution: The memory leak was identified in the microservice’s code responsible for handling user session data. The issue was addressed by refactoring the code to properly release memory resources after each session. A patch was deployed, and thorough testing confirmed the memory leak was resolved.

Corrective and Preventative Measures: Improvements/Fixes:

  • Implement stricter code review processes for new code deployments to catch memory leaks and other performance-related issues.
  • Enhance monitoring to include detailed memory usage metrics for all microservices.

Tasks to Address the Issue:

  1. Conduct a post-incident review meeting to discuss the incident response process and identify areas for improvement.
  2. Review and update the incident escalation process to ensure timely involvement of appropriate teams.
  3. Develop and implement automated tests to detect memory leaks during the continuous integration and deployment pipeline.
  4. Establish a clear documentation process for new code deployments, including code changes, dependencies, and potential impact on system resources.
  5. Conduct a thorough review of the application’s architecture to identify potential areas of optimization to prevent similar incidents in the future.

This incident highlighted the critical importance of thorough code review, rigorous testing, and proactive monitoring in maintaining the reliability and performance of our web application. By implementing the recommended corrective and preventative measures, we aim to minimize the risk of similar issues and ensure a seamless user experience.

In conclusion, the outage was caused by a memory leak in a new microservice, resulting in slow application performance. Through collaborative efforts and targeted investigation, the issue was identified and resolved, paving the way for a more resilient application architecture moving forward.

--

--