RCA for SYNQ api, playback and dashboard outage on October 25th, 2017
This RCA addresses the outage our services had on Oct. 25th, 2017 which was caused by an outage on our DNS provider. The services affected on our end were our Video APIs, playback, upload and the user dashboard.
On October 25th, 2017 at 13:02 UTC, we received a notification from PagerDuty (our alerting system) that one of our health checks in Runscope was failing. Operations investigated and noticed video API and playback heath checks were also failing.
Operations and development started troubleshooting for root cause and noticed that most of the error messages in Runscope were DNS timeout related errors. After a few minutes the issues self resolved. By 13:08 UTC of the same day the failed tests were passing and by 13:40 UTC operations confirmed all tests were passing.
After further investigation, it was determined the outage correlated directly with an issue on the DNS provider. Once the DNS provider resolved their issues, the tests began passing.
Root Cause and Remediation:
The root cause of this outage was that the DNS provider had an outage and since our services depend on using DNS for the various service, this caused problems with the APIs, playback, uploading and the dashboard. Issue was remediated once the DNS provider restored their services to normal.
Future Preventive Measures:
- Be more aware of failures on DNS provider using their status page
- Explore potential remediation steps in case of prolonged DNS issues