Evolution of Live Operations — Iteratively building Tools and Infrastructure.

Shimul
Pocket Gems Tech Blog
6 min readAug 1, 2019

Building a game that a player loves is incredibly hard. However, running and evolving the game regularly, that a player still loves after 3–4 years, is even more challenging. You need to be able to iterate fast without sacrificing the stability of the system. Bugs can be the death of a game!

This post will go over various challenges we faced and how we mitigated them by continually iterating on our tools.

Pocket Gem’s War Dragons Live-Ops Team is principally responsible for building and running events in the game. To understand how the tools work, let’s go over the underlying tech. Our events aren’t run natively on the iOS/Android system but use a hybrid architecture. So unlike the usual release cycle, we can push our events code as downloadable content to the player.

To be able to iterate fast, version 1 of our tool was elementary.

  1. We had predefined servers for testing, where engineers would deploy, as needed, for QA to test.
  2. PMs (Product Manager) would upload the data in JSON format to test server and associate it with the event version. Since most of the data from one event to the next are usually alike, they needed to copy over the data manually from past events and make adjustments as needed.
Event Tools V1

This system was agile and helped us speedily develop several events, iterate and understand our player base. However, we soon started facing complications with our tool(s) when we started growing and moving fast.

  • Multiple teams started pushing to our codebase without the knowledge of the Live-Ops Engineers. As a result, the code tested by the QA wasn’t the same as going live in production.
  • The JSON data for an event being copied and edited grew to an unsustainable size. It resulted in a lot of manual errors, whose severity could vary from a missing image to the whole game crashing.
  • Every time there was a fire, we needed to compensate players, and we would always end up with extra work for this.

These not only added unnecessary work on the team and reduced team morale but also negatively impacted the image of the company/game to the players.

Improving engineering workflow

We needed to make sure that QA can test on the latest code while allowing the engineers to do their work without worrying about which version to deploy manually.

Luckily we already had an in-house Jenkins for our application build system. We created new configurations in it to build out a continuous deployment system for three of our staging servers. So now anytime someone pushed some changes to master, it was auto deployed to all staging servers and QA was testing on the newest changes. [How we incorporated development branches for testing will be explained in a later blog post].

Okay, we now have the latest code deployed in the staging servers, but how does the QA/PM test all the different events without engineering intervention? To solve that, we then built a tool that allowed anyone to map an event to a server version.

Now, anyone can map specific events they want to test in the staging server, with the latest code. It’s hard to have a perfect cross-team communication system, but it’s possible to overcome that by building out the correct set of tooling.

Improving Product Workflow

The challenge here was to:

  1. Have a data type that is readable to the human eye, easy to extract and transform data programmatically, and can be used for review in Github.
  2. Have a way to quickly navigate old event data for reference and act on the data.

Our product managers were already working with CSV regularly, so it made sense for us to utilize the same file format here. However, the issue was that some parameter values might have data nested multiple levels, which would again make it look like a JSON blob.

We finally decided to use a flattened key-value structure for our CSV file format. To keep the files manageable, we split the parameter into logical chunks and created separate files accordingly.

{'teamProgression': [{'player_numbers': 10, 'target_points': 20000},{'player_numbers': 10, 'target_points': 140000}]}

Above JSON data is now represented as:

Once we migrated our existing events data to this file format, we built out several other tools on top of it to reduce manual errors and improve the QOL.

  1. Create an event based on a previous event.
  2. Build out a system to generate a diff between events.
  3. Build an event-parameter review system.
  4. Use Jenkins to create and release events and notify the team automatically.

Improving Operations

We reduced our fires significantly, but things can always go wrong in a live product. We often give compensations to the player based on the severity of the fire. We also distribute gifts to our players, as a sign of gratitude on select days. Every time we need to gift/compensate our players, we would need an engineer to write a map-reduce function. QA and PM need to verify it works as intended. We then need to dry run in production before pressing the go-live button. I personally dislike map-reduce due to the operational risk it possesses. Changing a massive amount of data without backup is risky.

Hello PG Gifting Tool — We created a tool that accepts input of currency, consumables, and other in-game items. It can accept static amounts for all players or a variable (defined in the CSV uploaded with the player and variable data input). Once approved by a second person, this tool internally runs a map-reduce function to distribute the item(s) mentioned. It reduces all the risks we previously had by taking all necessary precautions. The tool cut the work needed for engineering, PM, and QA for the job.

Overall Impact:

  1. From 2+ fires due to manual error, per week, we now have 1 or less such fire in a quarter (24+ fires to 1 per quarter).
  2. We needed one full-time PM to handle event releases and a couple of days of engineering work per week. All we need now is a couple of hours per week from a PM, that too if parameters need to be changed.
  3. Our PX team can run campaigns and distribute prizes without any engineering effort.

We developed the tools either as part of the team hackathons or during the downtime between major projects. The necessity of these tools was identified based on the following questions:

  1. Will it improve our work efficiency? (The time gained due to improved efficiency over ~6 months needed to outdo the effort to build the tool. The equation may vary based on the scope of the tool.)
  2. Will it improve our system stability noticeably? (This is a subjective question, and you should consider answering it along with other product priorities you have.)

As you can see, continually iterating upon our tools has helped us improve our stability and reduce our workload at the same time — a win for both our players and our Gemmers!

Join our team! Pocket Gems is hiring!

--

--

Shimul
Pocket Gems Tech Blog

Director of Engineering - Pocket Gems; Expert in building teams; Problem Solver; Successfully co-failed a startup.