1 year of SRE 🩈

Jennifer Strejevitch
Product and Engineering at Condé Nast
4 min readAug 25, 2020

How to be effective as a small Site Reliability Engineering practice

When you are a new and small team in a big organisation you may not be able to embed yourself in other teams or concentrate on one application only. Instead you can grow the practice in a centralized way, concentrate on benefits across the board and really prioritize the customer. We found that if you focus on one goal as much as possible, you can really make an impact.

The first thing we did when we started the practice was to survey the current applications’ release processes and observability and had conversations with stakeholders, developers and tech leads to understand their main pain points. What we heard from teams was that they really cared about:

  • Alerting noise/fatigue
  • Clarity around alerts (ownership, understanding and creating monitors)
  • Lack of effective monitoring
  • Visibility on what’s changed and why

We determined from the survey that there was little visibility around when features were released and lack of confidence in monitoring and alerting in general.

Throughout the year we focused on solving these problems and among our wins the following tools and processes were implemented:

Production readiness

We took over a Production Readiness process and established a checklist, mandatory for all new projects and architectures going to production.

It’s a questionnaire that asks about your application’s reliability such as its deployment process and what the failover modes and processes are to recover your application.

After that an SRE member sits with the team (developers and Product) and discusses any points that the team is unsure about and we mutually decide how comfortable we would be if some of the items were not fulfilled (e.g. if it’s a backend service that doesn’t require being up 24/7, you might not need monitoring and an on-call rota) . Recommendations and risks are raised to Product so that service level expectations can be correlated to software quality.

Removing pager fatigue

The metrics we had available from our CDN provider didn’t give us the granular view we required given our multi-origin setup. For that reason they had not been used previously as the main source of truth for our monitors. However the alternative used wasn’t great either. By reviewing what we needed from our metrics and replacing them with new KPIs (Key Performance Indicators) we reduced on-call fatigue with SLO based monitoring.

Deployment visibility

From our survey and previous Post Mortems we learned that people usually had a hard time answering the question “Has application X been deployed today?”. We created a script in our CircleCI pipeline that talked to the Datadog API and created a deployment event that could then be added to the teams’ dashboards which could easily answer the question. All teams found it very easy to adopt and it was a major improvement to visibility. Again we work thinking of easy adoptability (yes, not just adaptability, but something that’s easy for any team to adopt) across the board.

Monitoring, reliability and on-call Guidelines

  • On-call University: We created the “On-call University” which is a course that everyone that is going to join any on-call rota must take in order to gain more confidence about being on-call. They learn about all rotas, escalation, PagerDuty and Game Days, where we simulate a real issue and go through the whole process and resolution as if it were a real incident. The aim is to enable the Engineer to feel confident being on-call.
  • Golden signals: Many of our applications follow similar patterns (e.g. Node.js) and are deployed on k8s, therefore there are some common known signals that are useful to everyone. We also keep our monitors in Terraform so we created a module that can be extended by everyone who deploys applications to our infrastructure. They get free APM and k8s monitors and only (if needed) have to concentrate on application specific custom metric signals.
  • SRE clinic: Sometimes teams want ad-hoc advice or have new requirements, e.g. want to make sure they understand the quality of service of projects that have several cloud dependencies and help with understanding their SLOs.

Platform load testing in production

SRE introduced weekly load testing in order to proactively ensure we are resilient to spikes or growth in traffic (e.g. given the introduction of new websites/brands in our platform) and have confidence in capacity planning.

Chaos engineering

SRE also introduced configurable and automated Kubernetes resources failure injection so we know application and platform weaknesses ahead of time.

SRE Newsletter

SRE introduced a weekly newsletter aimed at all levels, from developers to Product and C-level, showing SLOs, trends of incidents, risks, load testing, deployment rates and what’s new. Teams feel encouraged to output those metrics (such as deployment events) so their data appears accurately in the newsletter.

We know we still have a long way to go and need to spread these practices across the globe. Also, not everything can be done in a generic and centralized way but I am confident we used the resources we had towards bringing the most value within the areas we had power to make an impact. So my advice for a small and new team is to initially concentrate all your forces towards one goal, that way the team can bond more easily and learn each others strengths and do great things :).

Special thanks to Clayton Howe, Hassy Veldstra, Khanh Nguyen, Laura Hamling-Fry & Lee Davies :)

--

--

Jennifer Strejevitch
Product and Engineering at Condé Nast

R&D Senior Engineer @VMware and Co-chair @CNCF TAG-App-Delivery, previously Site Reliability Engineer @Condé Nast