SRE

Site Reliability at DAZN

Craig McLean
DAZN Engineering
Published in
5 min readSep 13, 2021

--

Hi! I’m Craig McLean, Head of SRE at DAZN. We want to talk a little about how we keep things reliable, what we do when things fail, and how we learn from all of this. We hope that by doing this we’ll get to hear your thoughts, and maybe something about how you do reliability!

Who are the SRE team?

DAZN SRE is part of the core Platform & Tooling shared-services group. We are a team of 7 software engineers (soon to be 9!) working in the UK and The Netherlands, who love to solve reliability problems.

We exist to specialise in reliability practices, like those written by Google, and to build tooling to support and share these with our 500-strong developer community. We create software, integrations and solutions every day to help the teams gain a clear view of what’s going on, avoid problems and react well to incidents when they inevitably happen. We live inside the blameless, build-run-own world of DAZN Product Engineering, and run Scrum with most of the trimmings. We write mostly Golang, with a little Node scattered around, and occasional Python :)

How do you do SRE?

As a team of software engineers, we live in a very similar world to everyone else in Product Engineering across our 200+ AWS accounts. Our services can start throwing errors, being overloaded, being buggy, or just plain failing. It can also be unknown by others and sit idle when it could be adding value.

Because of this, we often look first at ourselves and solve the problems we see within our team. We also speak regularly to other teams about their challenges and feed those into our backlog. Finally, we scale our solutions across the business.

So in our team, we have 5 pillars of focus:

Focus and Communication

We try to ensure we’re doing the most important things, which solve the biggest problems faced by DAZN. We make sure the solutions we come up with are discussed early and often with the stakeholders, and we clearly communicate changes and releases to the wider community so they know what’s coming, and what’s here.

This pillar includes our weekly ‘how to’ presentations, regular roadmap reviews, our presence in all Post Mortems for serious issues, and our new Community Service, which targets blog posts, release notices and other important information to the right audience.

Observability

They say you can’t manage what you don’t measure. New Relic is our weapon of choice, but being a big AWS customer we also have Cloudwatch; plus we have built things when New Relic can’t currently provide them, like our SLO service (called Djed) which is Grafana sitting on Victoria Metrics. To enable the dev community, we provide integrations and example code that capture standard and bespoke metrics so they can be quickly visualised and alerted on by individual teams. We also provide example terraform for creating all the New Relic goodness like dashboards and alerts.
Once an alert has fired, you might want to reach for the logs. To enable this we run a large-scale centralised logging solution based around Kinesis, which ensures when things go wrong all our logs are available quickly.

Alerting

We don’t have many processes, but the incident process is one we do have. Nothing is worse than feeling lost at 3 a.m. when your service implodes, so SRE built a simple process including the use of Pagerduty to notify teams when problems occur, and a senior coordinator team on a rota to assist if folks get stuck, or big problems occur. Again we provide developers with the ability to set this up and notify via Pagerduty from any of our data sources including Cloudwatch, New Relic and Djed. Of course, alerting is just a webhook so they can build their own if they want.

Capacity and Performance

While the Cloud team have a great autoscaler it doesn’t scale in advance of upcoming events. To help with this we partnered with the Machine Learning team to build Smart-Scaling. This uses machine learning to estimate upcoming concurrency and passes this back into New Relic. From there it applies a service-specific “magic scaling value” and uses the result to pre-scale services to where they need to be just before the event starts. Of course, to know the magic scaling value you need to know your performance, so SRE also provide K6 Cloud, integrated into our CI tooling so teams can check from release to release what their service performance looks like.

Data

With all these tools and others, we have oceans of data. This pillar is going to be a big focus for us in 2022, but for now, we have standardised per-repo manifests which provide metadata about services via an API that is used all over the place, including cost dashboards and pagerduty. We have also created a release page to replace our human Change Advisory Board by visualising for teams everything going on around them, such as incidents, events, customer numbers, etc. so they can be confident they are releasing at a good time. Data also helps us learn where things go wrong most often, so we can focus our efforts where they are most valuable.

Where do you struggle?

Adoption

The comms pillar has plans to help with this, but it’s hard to keep 500 fast-moving developers fully aware of what we do, especially as they are getting notified of lots of other things as well. The Communities Service is designed to make sure our comms is always valuable, and never filtered out of people's inboxes.

Staffing

It’s been super-hard finding talent to join the team and help us build and improve all these things. We’re a great place to work, and we pay pretty well so if anything in here interests you, apply to join us!

What’s next on your roadmap?

Growing the team

We have a bunch of ideas in the backlog to help make SRE more valuable at DAZN, and we need resources to help us both deliver these, and define what should come next.

Front End focus

We’re a back-end team, so we have traditionally had a back-end focus. Starting in Q4 of 2021 we’re increasing our focus on the front-end, to help them get the same level of quality SRE services as the back-end teams have. We’re going to start by standardising error ingestion across the front-end and increasing the reach of New Relic into the end-user devices to build better observability.

Data, data, data

We know an awful lot, but we don’t always know that we know it. We want to push more decision-making into the world of data, to allow us to know when things are happening, or are going to happen, and act accordingly and to build a more global understanding of the DAZN customer experience. For example, in the incident space, we can learn what impacts are seen across other services to spot unknown dependencies. We can start to see the paths our users are taking across our whole estate, and measure every journey more accurately.

Your thoughts?

We’re super keen to get feedback from anywhere and everywhere, so if you have any thoughts or comments on what we talked about here please don’t be shy! Let us know in the comments!

--

--