RCA for SYNQ dashboard login and registration outage on August 11th, 2017

Introduction:

This RCA addresses the outage that happened on August 11th, 2017. The following services were affected:

  • Login and registration to the SYNQ dashboard

Event Description:

On August 11th, 2017 at 17:13 UTC, we received a notification that one of our third party hosting providers was having an issue. Since this hosting provider provides logging and registration services, we tested these services and noticed they were not working as expected. We posted on our StatusPage to let our clients know there is an issue. In addition to that we had similar issues earlier in the day that were resolved by development. Issue was immediately escalated to development for further investigation and resolution.

At 17:39 UTC, development identified issue and began to perform a temporary fix for the issue.

By 18:05 UTC, the fix was put in place and operations confirmed the fix restored logging into the dashboard and account registration. The issue was related to new IPs from the third-party provider being blocked by our system. At 18:09 UTC operation updated our StatusPage letting clients know the affected services were functioning as expected.

Since the fix was only temporary, development continued working on a permanent fix to prevent the issue from repeating.

By 19:32 UTC, development implemented a permanent fix to the issue, so that the services for logging into the dashboard and site registration work with dynamically changing IPs.

Root Cause and Remediation:

There were two issues, one, our system was susceptible to breakage when the Dashboard provider IPs change, second, this was occurring more frequently due to an infrastructure upgrade on the provider’s side.

To remediate this problem, development changed the IP blocking schema to account for frequent IP changes from the Dashboard provider’s side.

Future Preventive Measures:

  • Add monitoring on a schedule for logging and site registration
  • Redesign integration between Dashboard and backend system to be even more fault tolerant to changes while still providing good security