Building mOps (Modern Ops Process): Transparent Status and RCAs

Some Context

The way companies handle operations these days has changed quite a bit. A major change I have witnessed myself at various startups and enterprise companies I’ve worked for is the need to be more transparent. Transparency gives customers a real glimpse into the company’s internal operations process and workflow. It helps build trust with clients by not hiding issues or incidents and it shows that you take the problems seriously and are always working to become better.

Transparent Status and RCAs

With that being said, here at SYNQ we are utilizing various modern tools to bring transparency to our clients. We use a public facing StatusPage that shows the health of our services and keeps a log of any incidents or service interruptions. The StatusPage is directly connected to our Runscope testing and monitoring platform so customers can be updated in real-time. Customers can also subscribe to the StatusPage feed or follow us on Twitter (which StatusPage automatically updates). Finally, these notifications are sent to our internal Slack channel where our ops and dev team can react quickly.

On top of this, we’re looking to implement a new process, where we will post our RCAs on Medium. Just for some context, “RCA” stands for Root Cause Analysis, a document, traditionally emailed to specific clients, to explain the “why, what, and how” of an outage and provide detailed information about what happened and more importantly the steps that will be taken to prevent the outage in the future. After publishing the RCA blog post, we can then share the blog post via our StatusPage.

Using the tools and methods described we can show customers that we are transparent in how we operate and thus always working to improve our processes.

Below I have provided a RCA blog post template, feel free to use it in your RCA process as well.

RCA Blog Post Template

Introduction:

Give a brief description of the outage and what particular service or services were affected.

Event Description:

Here you can say exactly what caused the outage. When it was first noticed and reported, make sure to include a date and time. You can also specify how long the outage lasted and also include a date and time of when it was fully resolved.

Root Cause and Remediation:

Here one would describe why the problem happened and give a brief high level description of how it was solved.

Future Preventive Measures:

Here one would give a bulleted list of things that were changed and will be done to prevent the particular outage from happening again.