My time at DAZN

A Journey of Discovery

A retrospective look into what DAZN is and some of the engineering challenges past and present

Tim Lewis
DAZN Engineering

--

Plotting a route of adventure and learning

As my time at DAZN draws to a close, I thought it would be good to do a retrospective of my time here and take a look at some of the things I have learned and imagine what may be to come for the company going forward, at least from a technical perspective.

I think it’s fair to say that my journey over the last 3 years has been one of huge enjoyment, achievement and learning. Helping build and shape a development centre in Leeds has been supremely rewarding. It has very much been a voyage of discovery and I’ve learned a terrific amount in that time. I’d love to share a few of these learnings and thoughts, looked at through the lens of my own personal experience.

Engineers Are People Too

One of the things I really love about DAZN is the culture we have built. Whilst I would be lying if I said it has always been plain sailing and that we’ve got everything right, every time, it has been a place where individuals are valued. I don’t mean that solely in the sense that we want high performers churning out top-level code (though obviously it’s great that we have that as well!) but also from a psychological safety point of view; we care about individuals, not about employees on a spreadsheet.

People Power

Obviously, the last 16 months (and counting sadly) have been incredibly tough on everyone. No one expected a global pandemic would force us all to be prisoners in our own homes for a good chunk of the time it has been rampaging. It has been a unique set of circumstances that has come at a cost that I suspect we will be counting for many years to come. A cost that is not just financial or economic, but to people’s mental health and the way that we do work going forward.

It is very easy to just see that it is the simplest of things to have a software engineer have a laptop and an internet connection and work from home. What more do you need? You have ways of communicating with team members. You have the IDEs to write the code. You have the systems to deploy everything into the cloud. Simple. No problems. I admit that, when lockdown first happened, this was largely my thinking on how we’d deal with it, which was great for a few months because we’d be back in the office by the summer.

In the words of a prophet of sorts (OK — it’s Dave Mustaine from Megadeth in the song “Sweating Bullets”);

Hindsight is always 20:20, but looking back it’s all a bit fuzzy.

It all seems a blur now but we were wrong when we thought it would be done in a few months. We were also very wrong to assume that just because we had the technical tools in place to work in lockdown that we would have all the tools to actually cope with it. We are, on the whole, social creatures. We thrive off interactions with other people. We form “tribes”; social groups who work together to achieve a common goal. Underpinning that is a whole bunch of almost subliminal communication that happens organically and in person, by being physically there. No video conference will ever communicate the body language we have developed as we grow up — future generations may develop new ways of effectively communicating virtually, but us older folks need those visual cues.

Also, woefully overlooked, was how good or bad a working environment people were left in. I was supremely fortunate to have moved house 4 months before lockdown happened. We’d moved as we needed more space. I had my own “man-cave” retreat. Not everyone was nearly so fortunate. People in house shares or in small bedsits were having to work on top of each other. People had kids they had to look after and try to school as best they could. Relationships suffered. Mental health frayed.

How did we get through that all? It was a monumental task when you really think about it. Underpinning it all, I feel, was the fact that we were fully enabled to give every individual the time and space to deal with everything in their own way. If I take one thing away from all of this, it’s the fact that everyone has their own way of dealing with things, their own coping mechanisms. It is futile to try and enforce solutions and programmes of behaviour across a group of individuals; treat everyone as the person they are. Give them the space to breathe and deal with what they have to deal with. We’re not just names on a spreadsheet; we’re all people too.

9x4x1

The set of numbers above make me think of a thing! I could, at this point, disappear down a rabbit hole of sci-fi reminiscing, but to keep this blog on track we’ll just stick to the thing I’m thinking of! The Monolith.

The primitive Monolith

I’m sure there have been countless posts about how microservices are better than monoliths (and a few that argue the other way no doubt) so I won’t go into the whys and wherefores, but, on the whole, monoliths are bad, right?

We all (largely) agree that microservices make things easier to operate, maintain and support because they’re discrete services that do one thing well. This is great, but do we actually escape the monolith by simply breaking one up into small pieces (microservices) and distributing it? You can probably guess the answer to that last question.

Taking any large, complex system and breaking it up into discrete units is a fair amount easier to say than it is to actually do. There is a very easy trap to fall into which seems to happen a lot more often than you would care to think. It is, in some part I feel, owing to a certain way of thinking about problems and how they have always been solved that leads us into these pitfalls.

When we take a complex, intertwined system and break it up, the temptation to just define how the interactions between the new, “separate” parts exactly as they were before is very strong. After all, it worked before, right? Why change anything? If this bit of code hooked into that bit of code then we can do that again just by hitting it up via the new API call we created to access it. Great. But what benefit have we achieved here? The two code pieces are as still tightly bound together. Yes, we may be able to scale it, but if our brand new microservice is dependent on that next microservice to do something for us and it doesn’t, for whatever reason, then we fail too.

In essence, if we’re not careful we can find we’ve taken a monolithic solution and just distributed it, essentially getting a “worst of both worlds” scenario. Not only do we have all those brittle dependencies everywhere, we’ve also added the complication of all that extra infrastructure and operational overhead.

So how do we avoid this state of affairs? I’ve always found it easiest to assume that every single other service, be it large, small or micro, external or internal to be a black-box. A black-box that we don’t know or care how it works (at least we shouldn’t care! You don’t care about how someone else does their thing, do you? Let them be experts at their thing and we’ll be experts at ours!) Treat them all like they’re run by an external third party, just with the benefit of knowing that you’ve got really good lines of communications open with them.

By doing this we already are preparing ourselves for dealing with an “external” service. This leads us to ask questions such as “what happens if they go away?” or “what if they start changing what they send us?” and similar. It also makes it easier for us to think about synchronicity — what happens if we get a slow response? How do we deal with that? What happens if a downstream service is unable to accept the thing we want to send them? What do we do then?

These sorts of questions are vital for unhooking our thinking as well as our code! If we make each part of our overall service, robust and able to continue to function when other bits fail then the overall service is massively improved. All obvious stuff, I’m sure you are thinking. So where’s the problem? Well, first up it’s easy to say this stuff but actually doing that extra work can be difficult, especially if you’re working to tight deadlines — no one is moving the start of the Premier League season so we can finish off that last couple of lambdas. It’s very easy to just think “Ah **** it, I’ll just call that internal API for that service direct because I get my stuff done quicker.”

Again, this all seems obvious when written down but this, and very similar things, happen. A lot. You then end up with a fragile network of services that if one goes then others fail too. These problems can be solved though. We often talk about decoupling services but to do so often requires a bit of a shift in thinking about how services communicate with each other. Message based architectures help massively, as does infrastructure that supports such things as service discovery, service meshes etc, as well as considering ideologies like chaos engineering. I won’t go into detail of those here but they are things we’re doing at DAZN to solve some of the issues I’ve listed.

All that good work can be quickly undone, however, with environment abuse …

(Fr)Agile Waterfall

Back in the mists of time, we used to write monolithic applications utilising basic source control — who remembers using CVS? Requirements were given to us up front for the whole thing and we went away and built it. If we were feeling particularly generous we might even have had a staging environment that we could install the fruits of our labour on for the relevant business owners to see what we had done, before we unleashed it on an unsuspecting user base.

The Victoria Falls run dry

Fast forward some years (I won’t reveal how many as it’s a depressingly high number) and we are in the here and now. The concept of staging environments have followed us through the years and we’ve potentially picked up a few others along the way as well, such as DEV and TEST.

“So what!?” I hear you ask. Well, imagine this idea transported into the world of microservices, where, if we’re following our ideal from above, each service is it’s own black-box and should be treated as such. The concept then, of having a DEV/TEST/STAGE version of our overall product becomes a major issue, especially if we say that every microservice’s STAGE should be linked to its connecting neighbour’s STAGE etc.

Doing this, to my mind, is analogous to taking a team of elite footballers and making them tie themselves together. No one can perform their role unless they drag the others with them — heaven forefend if anyone should want to do something different, like pick up that marauding full back, if the team has decided they need to run down the channel on the opposite flank.

Being a little more serious, it is an absolute killer for productivity. If other teams are relying on your stage environment to be there and responding to requests, how on earth do you release something new that doesn’t involve a hideous overhead of delivery alignments and support?

Don’t be fooled by environments!

Remember; each microservice should be a black-box. As a team, you should never, ever expect to hook something up to someone else’s DEV/TEST/STAGE environments. As a team, you should provide end-points that other teams can develop against, if you are unhappy or cautious about an active development piece hitting your production end points.

Advertise your APIs. Tell others where they can hook up to when they’re developing. There is very little reason why this should not be production in a vast majority of cases.

Physical environments and data flows are separate things!

I’m not sure I can stress that more. Expecting a raft of different teams across many services to support a complete end to end flow on environments that are, by their very nature transient, is madness. It will kill productivity and requires an inordinate amount of management and is, by its very essence, waterfall thinking. Literally nothing stops you having test, staged and production data in the same “physical” environment. Business flows should absolutely not rely on different physical environments to operate.

Each service can have its own set of environments; that’s fine. If your microservice needs a STAGE then you should have one. If you decide that you will use your STAGE environment to serve data for other teams to develop against then that is fine. It’s your black-box; you decide what goes on in there. Treat any other black-boxes as third party suppliers. Respect your contracts.

There is, of course, a bit of an elephant in the room here in a rather large, unanswered question:

How do we test the full end-to-end product then?

Firstly, do you need to? Does the thing that just got changed need to be tested in the whole, entire chain? If the answer is yes to that then, in my opinion, that hints at wider problems in the system as a whole. Maybe we don’t trust that our change won’t break something downstream because the overall system is poorly understood, poorly architected or both. Maybe it’s that we don’t have proper versioning control or don’t have a properly specced out data API yet. Or maybe our releases are far too big. These are things that need to be addressed, and probably quite urgently.

Secondly, why do we need a separate environment to test the whole piece? We’ve already got prod and this is the place where the thing we’ve built will be living its best life. Surely, then, it makes most sense to ensure that all is right there? I don’t want to dive into the murky waters of feature flags and the like, but we can use deployment strategies such as canary releasing or blue/green deployments to help tackle this.

All of this is leading to our development philosophy and how we want to build the things we own and run. For the above to work then we need to do some fundamental things.

A Conclusion, or The End of the Beginning

For everything to work well we need to be able to release small, iterative changes, often. We need to make sure that we know exactly what is in production FOR OUR SERVICE at all times. We need to really understand the interface points between our service and those we connect with, be they up or downstream.

The ever lasting cycle of change and renewal

The golden rule that we should aim to follow and never deviate from:

Our trunk branch is what is in prod; What is in prod is our trunk

The only time this is not true is when a merge to trunk happens and the pipelines are deploying that change to production. To reduce the time that we are out of sync the releases should be small and tackle one change and one change only. I’m sure we’ve all heard the mantra “release early, release often” but it very much rings true.

To enable this and have confidence in what we’re doing then we must build ourselves a suite of automation tests that run as part of our pipeline and give us the levels of confidence we need to be happy for changes to go to our production. Remember, We build it, we own it, we run it. Which means if it goes wrong at 3am then we’re getting out of bed to fix it. What better motivation to get your automation tests solid?

Another rule:

Don’t tie yourself to any other service if you can possibly avoid it

Do you really need to hit that static IP to get that piece of data? Are you sure there isn’t at least a suitable API gateway you could use? Would it be better to subscribe to a queue of messages rather than polling? Maybe they can push to a message exchange that we can subscribe to? Decouple yourselves as much as possible! Use message based architectures if you can. Worry about your boundaries and what you are expected to do and make sure you get that solid and slick.

Also, what happens if that API isn’t there for some reason? How do we mitigate that? Can we use a stale cache or some other fall back mechanism? If not, can we fail gracefully and most importantly, let people know something is amiss? Make failure a routine thing. Build it in. Force other teams to develop exception handling for when your service isn’t available.

And another:

Measure everything

Really understand your service and what makes it tick. Know when something weird is going on. Forewarned is forearmed after all. Ensure that your monitoring and alerting evolves with your service. The ideal world says that we find an issue before it’s actually an issue and fix it before anyone notices. Can we see that the response times from that other service are degrading for some reason? Alert them. Let them know just in case they don’t see it yet.

And finally, one more for luck:

Embrace failure and learn from mistakes

As cliched as it sounds, we learn from when things go wrong. DAZN is fantastic at this — there really is a blame free culture and it’s so incredibly empowering. We figure out what went wrong and work out how to make sure it doesn’t happen again, or at least is mitigated. There is no blame, just an opportunity to learn and understand.

We are constantly renewing ourselves and trying to be better and more efficient. It is a continuous cycle of learning and improving, as the Ouroborus above depicts.

We’re all human at the end of the day and we make mistakes. I absolutely guarantee you, though, that the stories you tell about work in the future will almost exclusively be the ones about when it all went wrong. No one remembers another successful pipeline release in the middle of February, but they will always remember the time they made the national press for a mistyped command that triggered a flood of unwanted emails.

I think this is the strongest thing I’ll take from my time at DAZN — psychological safety in the work place is paramount. Engineers are people. People do good things and make awesome stuff when they feel safe and happy. Let’s keep allowing people to explore tech, try things out, make mistakes and grow. It’s a far more entertaining and rewarding journey than following the status quo and repeating the same old mistakes over and over again.

--

--