Postmortem: Service Disruption from Expired SSL Certificate

Sean Li
Fortmatic
Published in
4 min readOct 4, 2019

I’m writing to address Fortmatic’s recent first ever major service disruption. All systems are operational now, and we’re offering our sincerest apologies to developers, end users, and partners who were affected by this, and our deep appreciation to all of you who helped us investigate and stayed patient with us. I would also like to thank the Fortmatic engineering team for handling the situation with professionalism.

Summary

Our production fortmatic.com certificate expired on 09/30/2019 at 12:00PM UTC (09/30/2019 05:00AM PDT. Time will be referenced in PDT onward). Partners, developers and end users were affected from 05:45AM to 08:33 AM. For a period of 2 hours and 48 minutes, our landing page and API services were returning SSL certificate errors. It is an interruption of the service as any SSL certificate error is blocked and made aware of to the end users in browsers and by command line tools. This accounts for all of our current user base.

Root Cause(s)

The root cause of this incident is that we didn’t renew the SSL certificate in time. We rely on AWS to notify us when an issued certificate is about to expire so we can manually renew it. To dig further, we didn’t get the notification in the first place. This was caused by our email group having incorrect setting which resulted in the emails sent from AWS to get blocked. Since we never received the notification and didn’t proactively check the expiry on those certificates, we missed the renewal windows which resulted in expired certificate.

Impact and Analysis

The impact of this incident caused a service disruption to our partners, developers and end users. We had a total outage duration of 2 hours and 48 minutes. The incident took longer to resolve than expected. Due to lack of knowledge sharing on SSL certificate operations, the team was not able to address the incident as quickly as possible.

From the graph below, we can see a drop of incoming API requests between 05:45AM and 08:33AM. Because of the initial API loading calls in our product, we were still getting API requests. However, our product was never successfully loaded in the browsers and presented to the end users. During the outage period, we have seen an 80% drop in incoming traffic. And numerous dApps ranging from our partners to individual developers were affected.

One characteristic of SSL certificate is that the SSL handshake between servers and clients can still succeed, and the traffic is still encrypted even though the certificate itself has already expired; hence, our API traffic logs showed 200 HTTP status code returned for all requests during the outage. No alert was fired off.

Once the incident was declared all-clear, our customer success team relayed back to our primary partners, and our engineering team then started a follow-up EPIC to bootstrap processes and alerts for the improvement as well as drafting up a postmortem.

Lessons Learned

There are a few critical lessons learned from this incident:

  • Always check the reachability of your email groups from both internal and external domains to make sure that they are externally reachable provided that is the intention of the email groups. In our case, our sanity check emails into the group all came from internal domains, which left the external vector uncovered
  • Document unfamiliar flows, create runbooks, and over communicate the importance of the processes so team members can operate independently
  • Leverage external third-party services to monitor internal systems to make sure we have full coverage on system healthy

Action Items

The following action items have been addressed immediately post outage:

  • Investigate why SSL certificate was not automatically renewed
  • Audit all certificate expiry time used by all services
  • Improve our external monitoring on all services to prevent similar issues
  • Ensure and verified alerts are configured on SSL certificates that are about to expire
  • Separate SSL certificate between services to improve resilience
  • Document SSL certificate operation processes
  • Check all email groups settings

More Help

Fortmatic is rapidly picking up traction and adoption amongst developers with many new projects actively in development. We are looking to hire Software Engineer (DevOps/Infrastructure) to go on a mission with us. You can also check out our other open roles at careers.fortmatic.com.

If you are interested in learning more about Fortmatic and integrating with us, make sure to join us in our Discord channel or tag us on Twitter!

--

--

Sean Li
Fortmatic

ceo @magic_labs @fortmatic | ex-@docker @kitematic | @uwaterloo alumni