Maintenance: Building on a stable foundation

Pradyut Panda

Published in

Pocket Gems Tech Blog

7 min readOct 22, 2021

The story so far…

Move fast and break stuff they say … but then who will fix it later on for the long run!

Once you get beyond a year or so, the foundations on which an App runs start to develop some cracks. After 7 years, the Episode codebase was no exception and naturally along the way we had accumulated a bunch of technical (tech) debt. Fixing this tech debt would allow us to iterate quickly and spend less time building out new features.

This problem was compounded by the fact that we had no dedicated maintenance/tools team and so no formal way to work on tech debt and tools. There was also no culture on the engineering side of proactively addressing tech debt by doing small refactors as you go along. On the other hand, we had to ensure that we hit all our required items in time to be an app in the Goole/Apple stores. These are usually changes that come down from Apple and Google and need to be done if you want to release a client on the App store (for example — supporting a new OS or device).

Prework

Before formally kicking off the process, we took a step back and tried to understand what our needs were and how to make a strong case for maintenance! So how did we decide how much time we should dedicate to Maintenance Engineering work? We basically inserted Engineering into the annual planning process and wrote up a plan/strategy that incorporated the following components:

Required work, which we estimated based on past 1–2 years of experience. What is the work that comes up repeatedly? Software Development Kits(SDK) updates, support for latest device/OS, etc.
Large projects we planned to undertake, such as:
- Upgrades to our 2d animation library — Spine
- Upgrades to our Web Previewer tool — a JavaScript(JS) based authoring tool
- Re-architecture of our server code that delivers stories to the Episode app
- Building Tools such as Autobot — our automation testing framework
An additional allocation of ~10% for “everything else” — based on prior experience and 10–15% being a number often thrown around in blog posts & conversations as a reasonable/minimum ongoing investment in maintenance.

Based on this, our recommendation was to spend around 20–25% of time on maintenance

The Initial Maintenance framework

Given the above situation, we decided to formally propose a framework for the maintenance problem. Here are the things we did to kick off this effort:

Create a maintenance track for the studio to handle all the above including Required maintenance, Tech Debt, Tools, and Polish. This lets us prioritize different types of maintenance at different times
Commit to allocating around 20–25% of engineering time towards this work Studio wide. This is based on the above recommendation.
Set up a backlog in JIRA to track this work and see how we are doing against our goals. This allows us to create our quarterly reports.
Create a cross-discipline team to meet once a Sprint to prioritize and groom our backlog. This team ensures that specific people are responsible for maintenance work getting done.

Once we had this setup, we then ran it like so:

Set up a Maintenance team to groom the backlog every Sprint. That team would then send out the top items on this list to the Engineering Pods to pick up.
Have the Engineering pods pick from this list based on their other commitments and confirm what items they were picking up.
Create and send out Quarterly reports at the midpoint and end of a quarter to track how we are doing. This is to ensure that the process is still working and our highest priority items are addressed.

Initial Results

So how did things go? Overall things went well and we have some good progress against the backlog / OKRs (Objectives & Key Results — the overall studio priorities) to show.

The good

We were able to prioritize maintenance work and provide good visibility studio-wide of the work being done in this area. The cross-functional team was able to groom and prioritize the maintenance backlog and were excited about the work being done! Finding a product partner who understands the importance of this work and the role it plays in having a running app for another 7 years is super important. Luckily our head of Product Management understands this need and played an active role in the initial stages of launching this set of work.

The not so good

Sometimes Pods would have too many other pressing commitments and were not able to commit any time to maintenance issues in a given Sprint or Sprints. We were not able to make as much progress against internal tools work as we would have liked.

Try and try again

At Pocket Gems we believe in Iterating and Improving, so we thought we would try some more things to make a good thing better! The next things we tried were:

Adding larger/strategically important items to the Quarterly OKRs (Objectives & Key Results — the overall studio priorities) for the Studio. Each quarter the different disciplines were polled for top wish list items and based on ROI (Return on Investment) some would get rolled into the OKRs and assigned Pod owners.

In retrospect, all cross-functional teams (including Product Management), recognized the value of taking the time to address things that pose significant problems for the product or the team. Adding to the OKRs and explicitly assigning to teams at the beginning of the quarter, will make sure that there is a higher probability of success.

Result: This worked out well since it allowed us to make sure we had some high visibility items to point to at the end of the day. It also ensures that there is clear ownership of the maintenance work.

Creating a full-time FireFighter(FF) role on the engineering side and having this person pick up some of the smaller maintenance tasks from a curated backlog in the situation where the FF load was light. The FireFighter is a dedicated engineering to look at any emergent issues

Result: This did not work out so well, the engineers tended to prefer to work on Pod work that was interrupted before the FF rotation. This is understandable, people want to get stuff done and not leave it hanging.

Dedicating a full Sprint at the end of the year to maintenance work. This coincides nicely with the time that Apple shuts down and so no updates to the app could be submitted to Apple anyway. We took this time to also allow the engineering team agency in picking the tasks that most interested/bugged them from the maintenance backlog.

Result: Not as much work was done as expected, this was due to a combination of some engineers being out of office (OoO) and others working on tasks with tight deadlines. This is something that would be good to try again during regular working days as part of a regular Sprint.

The Current Maintenance framework

To recap here is our current framework

Allocate around 20–25% of engineering time towards maintenance work Studio wide
Track all work on the JIRA board
Allocate larger/strategically important in OKRs as part of the quarterly planning
The maintenance team grooms the backlog every Sprint, and sends out the top items on this list to the Engineering Pods to pick up.
Engineering pods then pick from this list based on their other commitments and confirm what items they are picking up.
Quarterly reports at the midpoint and end of a quarter.

Closing

Overall this initiative has been going well and we have met many of the goals that we set out to do.

Some of the high level wins from this process include:

We were able to get to all our required items from Apple/Google/etc. on time! We kept on top of this much better than in the past. These items are now planned out and scheduled instead of being ad-hoc requests coming into teams at the last moment.
We were able to decrease assets downloaded by around 40% due to switching existing data formats for assets. Based on some initial research, we realised that we could save major download bandwidth by switching formats (for example from PNG to WebP) for most textures used in the game.
We completed a major cleanup effort on the server that led to significant $$ savings. This involved auditing all data being stored in GCP Datastore and coming up with reasonable keep alive numbers for the largest offenders. Able to delete a couple of entities altogether too!
We created a more comprehensive way of backing up game critical data on the server for disaster recovery. This involved identifying our critical entities in Datastore and backing them up to Google Cloud storage on a rolling basis.
We were able to make numerous small improvements to internal tools such as our internal AB Test tool.

There is still room for improvement, however. For example, We would still like to do a lot more work on Internal tools that are used day to day by the development team.

In summary, this is something that is critical for more mature software that hopes to be around for a number of years. As an Engineering Manager you have to keep pushing for this — keep up the good fight!