Enabling LiveOps Across Games with Shared Operational Excellence (DORA)

We draw a line between our teams with each deployment to prod. We can measure where we are to create time and give it back.

Molly Sheets
22 min readAug 26, 2022
If a game developer makes it here — they get bought, get more funding, or die.

An executive once asked me why continuous integration and continuous deployment (CI/CD) is important. I said “Time.” Then we sat in silence for 15 seconds for us to process humanity’s unifier, parallelized across a room of 8 people (120 seconds). For better or for worse — I’m obsessed with it.

I like to create time through workloads. I like to give it back to teams. If I cancel a calendar invite, it’s a gift. If you put it on my calendar, it’s also a gift. I like to understand how early choices and busting silo culture manifests major decisions at companies. If you talk to me about it, you’ll see my eyes get really wide. I spend a lot of time…thinking about time to recursively give it back, slow it, accelerate it, and make things happen at the right moment. I hope to make you excited about spending the rest of your life understanding the time you have today, tomorrow, and the true magic that is making more of it.

Time is the single most valuable resource in the world and hard to recreate, except in this context. Operational excellence, achieved through testing in production, is an adventure of sharing empathy as we gift time to each other.

We start by accepting that we are all limited by this one resource. Our businesses — and the time remaining we have to give to those we love.

This is a long article, but if you care about operational excellence in infrastructure for games, it covers the obstacles with adopting efficiency metrics across teams and leaves with diagrams to start talking about measuring time at a massive scale by understanding how far apart centralized functions often are from those at studios — I hope you invest 22 minutes now to recursively get it back parallelized across people.

What You’ll Get From Your Time

This article (1) establishes the connection between business value and operational excellence as a result of adopting DORA metrics across a full game portfolio. Measuring these metrics accelerates operations by increasing team morale through increased infrastructure reliability, lowered likelihood of needed engineering response for on-call rotations, and quicker fixes through incremental commits and deployments. Simply, measuring how we work makes our infrastructure more reliable and resilient.

This article also (2) provides two intentionally hilarious, but true diagrams to explain this organizational challenge with humility. The first is “what is a game today” and the second is a “devops shared-responsibility communication model” for those considering adopting DORA by level setting where every one truly is. Use them to drive an empathetic culture shift on what is a challenging conversation for anyone with a significant games portfolio. They show us where we all are in real-time before arguments happen about the tools we need, who broke the build, and server-side alarms.

Holistically, this document proposes a need to split the ownership for reporting DORA metrics per workload, shared between teams who are co-dependent similar to AWS’s COPE model (Cloud Operations and Platform Enablement). While that’s a mix of cloud philosophies — DORA are measurable metrics while the other is a way of being.

“The application teams get assistance setting up environments, CICD pipelines, change management, observability and monitoring, and establishing incident and event management processes with the COPE team integrated with those of the enterprise as required. The COPE team participates with the application teams in the performance of these operations activities, phasing out the COPE team engagement over time as the application teams take ownership.” (Source: AWS Operational Excellence Pillar)

Teams should have shared measurable goals to make themselves less co-dependent where dependency slows getting to production and speed up in the right places by providing developer processes or tools that enable ownership.

The more teams understand each other in relation to getting to production faster (testing in production), the more they can make sure they are working towards developer productivity based on a real baseline. Understanding this baseline and where to speed up also encourages partners and external providers (cloud providers, vendors, engines, and platforms), for which they are often co-dependent regardless, to make decisions that accelerate this industry as a whole.

First, Let’s Remove The Fear tied to Performance

Anytime a company says the word “performance goals” it instills fear in engineers and product managers because at some point, someone experienced performance goals used in terrible ways: Organizations may create an unhealthy reliance on delivery as the metric for performance, instead of delivery being the result of good operations as the measure. Perhaps those were tied to promotions, hiring/firing, and re-orgs which create trauma. That’s not this.

An operational excellence goal should do for games the opposite of traditional performance goals. Operational excellence metrics should measure giving back time in the same way we look at infrastructure cost optimization to reinvest in new projects — If we save money here, we can use it elsewhere. If we save time here, we can give it back to people to have more personal time or learning - what your team needs to be human. More importantly, getting to production faster by way of incremental changes decreases risk and works towards the vision of a healthy on-call model because changes are easier to address and issues happen less often. De-couple operational metrics from how we measure the growth of individuals. Individuals should be measured by how much they champion and mentor each other (which you have to make time for) — not what they deliver. A team and all its co-dependencies is who delivers.

Ask anyone who has released a content monster of a game and they’ll tell you, in the end, how many features you released matters once you’ve found exactly what your players like and want more of, for example, puzzles. Then you still land at trying to figure out ways to make that faster through a system. For example, you may create tools to build match-3 levels that populate piece position from a backend and tools to auto-build levels turning a designer into a person to determine quality of a design built by a machine. Having a bunch of levels built in-engine creates huge client-side binaries, which take forever to build, both by a level designer but also compiled, forever to download, and a lot of space. That’s not any one person’s fault. It’s simply that hard to keep up with how much people consume content. For all of these reasons, measuring people based on number of features delivered is much less valuable then measuring operational excellence and getting to production based on deployment frequency, flow, time to recovery, and work interruption to name a few time-centric metrics because it leads to innovation that helps you keep up with players as cohorts of testing.

Enterprise games has needed a way to communicate how far apart or close we are org to org, persona to persona. As an industry we are often empathetic to our technical challenges but struggle with human reflection. We’re ready to discuss devops and reliability more deeply now that we’ve played around with the pain of de-coupling features into microservices components, but that pain came at the cost of people transforming into different jobs.

Measuring operational excellence as “getting to prod” across an entire entity creates a well run ship, but who has done this for games at a massive portfolio? They keep getting more massive. We’re still figuring out what is good for the player with regards to what they are forced to adapt to on the client. We’ve spent the last decade slowly moving more code to the server and chunking up delivery to happen in the background or on load or before the player gets there. The biggest challenge yet for our industry is that, with regards to devops for games, the discussion of shared responsibility between platform engineering teams and game developers is challenging due to how games are built.

I don’t expect an industry shift to valuing operational metrics over delivery metrics to be a short term effort but it is in progress and some teams are well on their way — give us 5 years to get to “great.”

Part 1: Time as the Business Metric Unifier

This on-going culture shift requires diplomacy and an acceptance that we are in a philosophical debate that challenges the games industry. Who owns what part of the games workload across an entire company? And what is a “game” now? Have you zoomed out recently, taken a breath, and…thought about it? Yikes.

Diagram A: “What is a game?” Zoom out and then zoom in. This gave me a full-on migraine to make.

In the last 5 years, game developers moved to decouple features from the client and instead serve as much as possible as cloud-native microservices — DLC, add-ons, map updates, changes to match-making algorithms, server upgrades, payments platforms, voice comms and more. As we segment games workloads into “bite size” units of architecture, we discovered we build more. We deliver faster. We enable teams to work in fun groups of subject mater expertise. We outsource what we don’t have skills to do.

To enable this to continue safely, teams align goals to (1) operational excellence (proactive liveops), not only (2) product delivery (reactive feature development). Eventually tools teams find the harmony between both by getting to (3) Innovation (proactive feature planning). But you can’t skip directly there or you end in burnout.

How do we balance those three states? What if a team is in a negative state (reactive liveops)? DevOps for Games?…it’s complicated. It requires your entire company to adopt the same mental model for measuring time, culturally, by starting with a platform or centralized tools team as sponsors of collaboration. They must facilitate a discussion of who owns what measurement between development teams and platform on a per workload (microservice/feature) basis, not only per game. THAT is much harder than deploying anything because developers have varied backgrounds, game genres, and existing tooling.

But it’s a challenge worth running at. If we care about operational metrics, we accelerate completion of delivery metrics as a byproduct and retain people. Also, if we find ways to make stampable workloads or features while aligning entire entities on similar cloud-native mantras (instead of everyone running in their own architectural philosophies), you land at the games business land of enlightenment of the liveops dream state.

Diagram B: “Publisher LiveOps Dream State.” Games in prod, some in dev, sunsets to make way for new experiences all integrated with teams that help us move faster. But with that comes agendas!

Enter DORA for Games

In 2018, Google acquired DORA (DevOps Research and Assessment). If you are not familiar with the DORA metrics you can read them here but in summary they are Deployment Frequency, Lead Time for Changes, Change Failure Rate, Time to Restore Service. I’ve copied their definitions below from the source.

  • Deployment Frequency: “How often an organization successfully releases to production”
  • Lead Time for Changes: “The amount of time it takes a commit to get into production”
  • Change Failure Rate: “The percentage of deployments causing a failure in production”
  • Time to Restore Service: “How long it takes an organization to recover from a failure in production”

Sounds interesting if you are on a platform team, are a site reliability engineer, or in devops, right? What about if you are not? We’re cynical and know frameworks are often not made for us as an industry. We have frameworks “applied to us.” When we apply DORA philosophy developed by vendors what is realistic? We run at the “test in prod” mantra because it definitely increases team morale. Incremental changes, smaller updates, means when things do break during on-call it’s much easier to push a fix and understand the problem. It literally, lets you sleep at night to push more often, faster, with smaller changes if in a liveops role. It creates cultures of psychological safety. That resonates. Hard. This all makes sense. And I want it. Badly. For everyone.

But you and I, who both probably work in games if you’re reading this, both saw the chart above about “what is a game” and the many parts there are to actually get into production. I’ll be honest. I think we’ve been underestimated.

The Challenge of Rolling Out DORA in Games

If you take Google’s DevOps survey to find out how you stack up against others in your industry, you get the same thing every game developer groans about when asked to input data into a survey form field: Forms often include games as part of Media & Entertainment.

Not a specific knock against Google — many vendors and third-parties are guilty of this.

Gaming is a $336B industry according to Bitkraft (Source: Bitkraft VC) when including streaming services, communication, engines, IP/content sales, and hardware. Newzoo in ’21 put it at $175B where it was already larger than TV, film, and music (Collectively, Media & Entertainment) combined before those were included (Source: Newzoo). Vendors, please give our industry the justice of having our own industry benchmarks for this achievement.

We have to consider DORA metrics and who owns them against:

  • Client builds, per platform (Mobile, PC, Console, AR/VR), per environment. “Prod” is a loaded term because it can mean different sets of player personas even within these environments for which changes affect.
  • The entire game lifecycle — A team may want press or influencer accessible only environments at a given time. The actual definition of what “prod” is changes as a game gets built out.
  • Seasonal or weekly content updates in an interactive ecosystem where players give feedback and the teams work as fast as possible to incorporate that feedback.
  • Whether a gameplay feature or microservice should be tested with different groups of players as part of “testing in prod” and whether it is more server-side code or client-side code or both.
  • Operational Infrastructure upgrades which includes updates for security, keeping stacks up to date with the latest versions of the infrastructure they run on, logging and monitoring tools, alerts and alarms for performance.
  • Backend functionality and multiplayer gameplay networking features that are authoritative to the backend bundled and rolled out as separate microservices. This includes but is not limited to: Changes to matchmaking, map configurations, and client-server communication that ideally isn’t out of band from client-updates.
  • Open-source and vendor dependencies that change as a game stays live.
  • Hosting all the other business critical related workloads game developers need (websites, analytics, self-hosted internal operational tools).

I’m tired. Are you tired? Get a coffee because we’re about to hit part two. As a recap — We now know what the 4 DORA metrics are and that they lead to business survival and winning by way of measuring operational excellence and developer productivity but that no one really made any games specific material or cared that much about our problems when they were written.

Part 2: A Shared Responsibility Communication Model for adopting DORA to Real Workloads — The “4 Square of Less Arguments”

We now understand measuring time by using DORA is how we could work towards better team health and accelerate development. We need a way to split the responsibility for those measurements between centralized platform, analytics, and game teams. This is one of those nightmarish challenges I like to call “running at fire,” and it gives me a reason to wake up in the morning.

Pretend you’re a backend programmer and I sit in a cloudops or platform tools team and gave you an infrastructure-as-code file, SDK, or a product.

  • Did I provide adequate documentation based on your background for your team to deploy it on your own?
  • What if it needs changes to fit your needs? What if you write great networking code and are comfortable managing infrastructure, but had a better idea for your team’s needs?
  • What if you need more compute for launch timelines — when?
  • What if you aren’t using the same infrastructure vendor and we just met?

I’m not trying to scare us as an industry. What I hope is to remove the fear of running at these problems that can cause communication friction when who you are is so far apart from who someone else is based on your experiences, knowledge, your goals, and where you both have been. In games we all have at least one shared goal — make great games that are stable for players and keep our jobs while doing it.

On a call many moons ago with wildly abstracted business goals, a lead manager once told me, he had a shocking realization when we both had a 10 minute conversation and realized that we had different definitions of the word “step function” within that conversation. He meant in the strategic sense, and I meant AWS Step Functions. Our teams were moving at completely different speeds in different directions by that point.

For teams to communicate around DORA in games because this is so hard, I’ve thrown together a shared-responsibility communication model (sigh, judge me for all the 2x2 matrices and quadrants you’ve grown up with — I am sorry). It is aligned to help us talk about those DORA metrics in a way that resonates with people who love radical candor and hate debating without empathy. I don’t have a proper name for it — I’m excited by the hope future teams may want to blow it up because it makes me look like a corporate asshat: so for now let’s call this idea the “4 Square of Less Arguments.”

Diagram C: “Devops Shared-Responsibility Communication Model: The 4 Square of Less Arguments.”

For any given workload, I want you to picture your team and the person you are about to talk to as a dot. Place them based on your understanding of where they are in building “the game” (knowledge, experience, goals, technologies, what aspect of “the game” they engage with the most every day) and then place yourself as a dot based on your own within this 4 square of less arguments. Draw a line between where you both are right now.

How far apart are you in what you know about each other’s needs? Somewhere on that line sits owning the 4 DORA metrics for the workload you’re discussing and both of you are going to move on that line. The closer you two get (which may not be in the center), where you clearly define who owns what DORA metric (but also SLO, alarms), play/runbook responsibility, and the developer productivity customer experience — the faster things get deployed, the less they break, the better tools and decisions you adopt, the better it is for players, and the better it is for your own mental health. Spend time understanding this first before you build or drive adoption for anything at all together.

It’s very likely at different points in your career you’ve lived in all quadrants if you were in games with different sized dots. And that’s okay. It’s also okay if you did something off the grid (the cloud is still data centers). People build tools with skills and tags to surface this stuff and no one looks at them. They look at faces on calls. So what are you showing to talk about the mission you are working towards together to surface where everyone has been all these years with regards to what we’re all adopting and building together? Try it! It genuinely hope it helps you see where everyone is to draw the productivity line.

A. CloudOps — Tools, Services, & Platforms Teams Integrations

In my adventures it’s very clear to me that if you are at a big entity these days you probably have a cloudops team. Everyone has different names for this team but loosely their goals enable productivity while serving both internal customers and external players. This is true both in games and non-games and many adopt frameworks that turn them into product-centered teams with customer-facing product managers or producers/project managers or both. It’s very easy to get into the trap of “We will build whatever the customer asks for” because it’s an exciting and new way to operate.

But remember the above communication model? Customers are very likely asking for things they don’t need OR need something different. This is largely because we didn’t, especially in our games journeys, all spend time in all 4 quadrants. In fact, if you worked on particularly longer development timeline games for PC or console your experience is going to be vastly different than someone who worked on other platforms, at an agency, or at an indie developer before they somehow got to today. Someone with a background at a smaller studio may have used a backend-as-a-service provider for hosting and not AWS or GCP of Azure for example before they worked elsewhere.

Cloudops teams aim to be a unifier with a less-is-more approach. They look to build tools that solve problems across games that can create standards in networking templates, infrastructure templates, security, observability, and deployment pipelines. They also are the best place to put your centralized finops because games are notorious for both over and under provisioning and a history of record for that helps everyone do it better. You also want to put finops here because centralized cost monitoring is how you get discounts for everything because they can see it across teams.

Managing any tool or template that is deployed from cloudops should use DORA metrics. The question is, is the cloudops team managing the templates, updates to the SDKs, third-party tool integrations, and what are they alarming on? Are they responsible for the infrastructure monitoring itself, application troubleshooting, and handling alerts/alarms? What if an application is causing a CPU alarm to go off and the internal customer should be capable of handling the fix without cloudops? When an alarm goes off — who answers? Co-dependent, shared paging: It’s a doozy live. When we need to make a change to something for a SEV 1, those DORA metrics sure do matter. Discussing them ahead of time surfaces where the real shared responsibility challenges are going to be in order to help solve problems that are actually challenged lines of ownership to address, not technical bugs to mitigate.

B. Client Code: Build Farms, Chunkers, & Patches Delivered at the Edge: Building, Testing, & Bundling Binaries

Build Farms are a clear use case understood by games people for operational metrics because it bothers them in an xkcd kind of way. Guy Kisel at Riot Games released my favorite detailed build farm dive in May ’21 where Riot covered their entire Perforce-Jenkins stack for Legends of Runeterra all the way to their Slack notifications. They even built a custom Git GUI because they realized end-to-end you have to care about the whole development team (artists, game designers, etc) to increased developer productivity and give back time to everyone. In order to improve operational excellence, not only those writing code, literally “Everyone commits.” This opened my mind up to the distance between all of us.

Build Farms become more complicated as a development team grows because the test suites required to test all the features and the number of platform targets becomes more complicated. They are characterized by min/maxing build times against quality and test checks resulting in typically less frequent releases to production as “distribution builds.”

In order to solve for finding a good deployment frequency to production, tools teams build patchers to deliver binaries in chunks (the chunker calculates the binary delta between the last version a player was on and this release), releasing based on diffs and reusing what they can from previous patches so the end user doesn’t have to download as much. However deployment frequency as a DORA metric in this context also heavily depends on what players are willing to put up with in terms of “downloading a new patch.”

For example, Riot’s League of Legends in 2019 shared over 85% of its bundles with its previous version. Immediately we have what looks like an overlap with cloudops (I don’t know what they call their version of cloudops) as serving bundles to reduce the number of requests to patches often happens via an edge CDN. Using Legends of Runeterra and League Patcher as an example, the shared responsibility for DORA would unite across build farm and cloudops at the stage when any binary or patch is ready to hit production.

Taking Riot’s use case, every deployment process, for each major environment, should track all DORA metrics in order to understand deployment frequency, lead time for changes, change failure rate, and time to restore for incremental builds, full builds, and also the patcher/chunker — and this should be reviewed by both teams. One game (Legends of Runeterra), Two teams, Two workloads where one workload is a shared chunker with a dependency to serve also another, larger game, League of Legends, as an internal customer.

But build farms, which primarily focus on builds for client distribution to production on target platforms are not the only consideration for manifesting time with regards to operational excellence and “getting shit done.”

C. Server CI/CD: Deployments for Microservices and Serverless Infrastructure and Client-Server Musings

Many infrastructure teams investigated GitOps with ArgoCD to deploy server-side templates to production as they use Kubernetes to do everything from multiplayer game hosting to build farms. I think this is fantastic, but it’s not the only solution out there. It is most noisy because of the Cloud Native Computing Foundation. If you do use GitOps though and have something like this it’s a great place to start to get some benchmarks and wrap a mission around DORA metrics as a practice for server-side deployments.

I’ve also seen a variety of trends around enabling more flexibility for developers to own more of the backend templates, instead of cloudops, by providing them in languages for which they are already familiar (Think AWS CDK) to configure and modify. The more ownership moves to the development team from the infrastructure or cloudops teams, the more guardrails get put in place for security. From that evolved concepts of policy as code. Policy authoring and open policy agent is its own robust problem-space to live in (such as which users can access specific resources and which clusters a workload can deploy to for example). Having a conversation between a cloudops team and a developer gets deep around specifically who owns what part of a workload every time a feature is deployed. These days features look like microservices — either serverless offshoots or microservices hosted in containers.

Server-side CI/CD and CI/CD on the client use entirely separate workflows and tooling. While they may both use git, understanding if every portion of “the game” is all up to sync is an exercise. If a feature isn’t co-dependent on another portion of the team it’s not a problem, but if it is? Awesome. Server CI/CD and client-side distributions don’t use the same tools to “get to prod” at all. There is a different model for managing server-side CI/CD and few outside of games teams themselves truly understand how to unify that with people who are living in client-side land. It feels like: “Just give them the AWS SDKs or a Terraform template and call it done and if they can’t handle it they should use Beamable!” I call this space “The void” but in the diagram it is also known as the “Hilarious ‘Prod is out of sync’ danger zone” that you have to design around. Recruiters call it “I’m looking for a full stack eng.”

If we discuss client-side integration for API calls to the backend: It’s a mess for anyone not using a third-party backend as a service provider. Enter: custom SDKs which teams didn’t make because they love DIY tooling. They wrote because they had to. While mobile improved, consoles, PCs, and the ecosystem around them need specific auth so game developers are forced to use 1st party distribution platform workflows and then make all of this comfortable in the client in a repeatable, standardized pattern across multiple games — or become a game distributor to challenge status quo. Or both. Meanwhile, developers cannot wait for engine vendors to solve cloud integration in a meaningful way at enterprise scale. Today, it’s still often faster, and more secure, for them to write their own tools for liveops innovation on top of clouds that plugin to all the things so they can provide that en mass to their family of games rather than wait on a vendor.

I’m excited to see what happens with integrations driven around better client-server communication and testing backend by exposing abstracted configuration to developers because I think, it’s very hard with regards to all the features different games need, and there is so much room for operational excellence to increase overall deployment frequency for a feature, a multiplayer or stateless game, decrease lead time for changes, and lower the change failure rate at that specific, abstracted, overlap of integration.

D. Analytics Teams & Monetization Integrations

There is a lot to discuss with analytics and monetization but speaking more generally if a studio has beyond 3 games, they begin to start looking outside of third-party analytics and monetization solutions because it has more business value to them. The moment they make that choice they absorb the responsibility for operational excellence for a lot of workloads and workflows. They move into the world known as data warehouses and data lakes to build their own solutions. Teams in this stage rapidly adopt the cloud because it is pivotal to their business to analyze their own player analytics. They dive into the world of data ingest, data storage, analysis, and visualization.

I won’t cover all the tools, architectures, and patterns as that’s outside the scope of this article but one thing I will cover with regards to DORA and team collaboration is that teams have to manage development clusters and production clusters for data warehouses and constant growth of data with those. That leads them to come up with new transformation patterns (ELT vs ETL) that fit their workloads’ use cases as they manage a growing data lake. Eventually, they start to understand what needs to be queryable in the warehouse by all the personas who need to know revenue, purchasing, gameplay analytics and more (30 days of data? 6 years of data? Who has access to what and where do we draw the lines for security?) versus what is better as cold storage.

Game studios drive data access and decisions around both historical data and materialized views (“duplicate”) views of live player databases. Deploying changes to these types of workloads, whether maintenance changes or feature requests, is a huge effort as the business impact is, if it’s managing data for multiple games, profound. It’s quite different to make changes to analytics workloads and all their parts vs changes to one game. Teams end up in a state of “getting all the data together” for all the games only to figure out ways to decouple it — there is simply so much of it and so many people trying to answer questions that segmentation by use case comes into play.

As the complexity of dealing with data gravity is one for which access and query patterns drive more than just getting to production, one could argue deployment frequency for the clusters themselves is not nearly as important a metric as change failure rate and time to restore service. More importantly, when changes are deployed is critical as larger, workload heavy patterns happen where processes will only occur at specific times of days, time of the month, or time of the quarter to pull reports.

So while client distribution builds for games are often what slows down the innovation and progress for server-side driving features by being off-sync if they are too co-dependent (which is what liveops aims to solve), the business itself is what slows down deployment to production to analytics features by the risk and madness introduced from data gravity.

EOF :wq!

I wrote this as a person who has survived the games industry and is still trying — you probably are one too. It has evolved so much. While writing this I couldn’t stop and had to ^C out of it. Truthfully I’ve only written where to start with DORA for games depending on where anyone sits and the conversations they are going to have tomorrow. It is going to be an absolute mess that I enjoy being in. I’m sure I’ll surface benchmarks at a later date on a blog somewhere after making sure I legally can and experimented alongside people smarter than I am. If you have a DORA-like benchmark you are already working towards for a workload, let me know — I’d love to see it.

Operational excellence is an adventure of sharing empathy to manifest time. When we engage people, what we are really engaging, is their time spent in the past, the challenges of the present, and the trust we’ll make good choices.

--

--

Molly Sheets

Director of Engineering, Kubernetes @Zynga | Former Principal SA, Enterprise Games & Principal PMT, Spatial @AWS | 25 Releases | 15 yrs in tech | ❤s CloudOps