Rewriting in the time of COVID-19

Headspace

Published in

Headspace-engineering

9 min readOct 15, 2020

Author: Monica Grandy — Senior Android Engineer

How we rebuilt our latest Android app from scratch … remotely

In what feels like a thousand years ago (although, in fact, it was October 2019), Headspace made a commitment to expand its offering beyond meditation.

The preliminary prototype was sleek, inviting, feature-rich, and very, very different from what Headspace looked like at the time — especially in terms of technology. As a matter of fact, we were basically looking at a completely different application!

The prospect of building out this new interface and feature set in less than a year was daunting to all of the engineering teams but for us droids, it seemed like a near impossible task. Why? Because our existing codebase was, to put it bluntly, a hot mess.

If you looked really hard, you might find a couple features that followed an MVP design pattern, but more likely you would encounter a bunch of fragile, monolithic, and spaghetti code. What does fragile mean in this context? I could make a simple change in one class and all of the sudden, a bug would appear in a completely “unrelated” part of the app. “Wait, isn’t that what tests are for?” you ask. It is! Did we have good test coverage? No! There is, however, an upside to all of this. Because so much of our existing code was tightly coupled, it had always been extremely difficult to refactor and address tech debt without completely putting the brakes on feature development. This fragility was reflected in a high volume of bug reports and crashes. Our team was exhausted from building features on an unstable codebase, while simultaneously juggling a seemingly endless stream of bug tickets. We saw an opportunity for a clean slate with this new app design and made our case to senior management. After a whirlwind of meetings, timeline discussions, pushback, and more timeline discussions, we had the green light to code-freeze our existing app (except for fixing high-severity bugs and crashes) and write our new one from scratch. Along the way, we laughed, we cried (well at least I did … at least once), we were sent home from the office and forced to manage work/life balance in a whole new way (not to mention less stable Internet connections). We made a lot of mistakes, but we must have done a few things right because we managed to deliver the new app on time to our users. Here are some things we learned in the process:

What we did well

Upfront planning

In order to build a stable and scalable foundation for the new Android app, we knew that we wanted a clearly defined architecture pattern, high unit test coverage and an easily navigable codebase. However, we really needed to put some thought into finding the right solution for each of those goals. Before writing a single line of code, we spent a week documenting and discussing our app architecture, package structure, and the libraries and frameworks we wanted to include. I led this initiative along with our other Android team lead as a way to facilitate discussion on what our ideal codebase would look like, and give everyone on the team a sense of ownership over the planning process. Each engineer was responsible for a topic we wanted to discuss and did the preliminary research. By the end of the week, we had made several important design decisions that would set the course of the following 6 months:

MVVM

We went with MVVM for a few reasons. Without a dependency on the view, unit tests were much easier to write, allowing us to set a high bar for coverage. Additionally the ViewModel and LiveData are lifecycle aware, so data is not lost in any configuration changes. In MVP, we had to manage the lifecycle of our presenters — which required a lot more maintenance (and led to a lot more bugs!)

Databinding

Databinding was something we went back and forth on, as it felt less familiar (adding logic to the XML — what?!) and we were primarily worried about our layouts becoming difficult to read and maintain as well as losing the ability to easily debug. The convenience offered by databinding and the appeal of writing less code in the view eventually won out, and we also learned that we could still debug our code by adding breakpoints into the generated classes created by databinding.

Navigation component

This was a pretty easy sell, since no one enjoyed writing their own Fragment transactions! We also use the SafeArgs plugin to easily pass data between destinations. This has really cut down on the boilerplate code we’ve needed to write, while still allowing us to test our navigation logic.

100% Kotlin

Previously we had a hybrid of Java and Kotlin. Although they are meant to cohabitate in the same project, we had to be hypervigilant with passing data between Java and Kotlin classes and found a number crashes where we accidentally passed some null data from a Java class into a Kotlin class method or constructor that was expecting a non-null. While the null safety afforded by Kotlin is great, it definitely works best in an entirely Kotlin project.

MockK for unit testing

While Mockito is a great library, it was built originally for Java. MockK gives us all the same functionality, plus additional bonuses like the ability to mock extension functions.

Package by feature

Our old Android app was structured so packages were oriented by type (Activities, Fragments, Services, etc.). We wanted to move away from that and take a more modular approach, where each feature is treated as an autonomous package. We still have several classes that multiple feature packages depend on and these have been grouped together in a package called “Common.” While we are not yet building and shipping our new features as independent modules, our code structure now is much more suited to support that in the future.

Isolating the Android team

Our engineering team is broken into cross-functional teams, each focusing on a different OKR. To maximize productivity and minimize distractions, we all left our squads for the first 3 months of the rewrite, so we could work closer together on building a solid foundation of the new application. We had our own JIRA board, 2-week sprint cycles, and a roadmap. By doing this, we were really able to cut down on the number of meetings that devs attended and keep everyone focused and on task.

Writing tests!

We knew we wanted a super solid foundation of unit tests and added a CI check into our workflow that would block any PR, with less than 80% coverage on new code. Adhering to an MVVM pattern also made it much easier to keep logic out of our views. While we don’t follow TDD, knowing that we are accountable for a high bar of test coverage keeps us from writing code that is tightly coupled and further enforces the Single Responsibility Principle.

Defining a rollout plan

It’s pretty impossible to rewrite an application without falling into a more waterfall-style development process. Still, we wanted to get something in front of users as quickly as possible and avoid the pitfall of releasing everything all at once. From very early on in our planning process, we decided to have two milestone releases, spread a couple weeks apart. Milestone 1 did not really introduce any new features to the user, but under the hood was built on an entirely new foundation. We wanted to make sure this was stable for our users and patch any severe bugs before rolling out fancy new additions like the video player — and we started this rollout at 20% for only 3 countries with a moderate number of users. Between the Milestone 1 and 2 releases, we split the team — so half of us worked only on stabilizing and fixing the existing release, while everyone else was focused on feature completion of Milestone 2.

What we could have done better

Getting assets from other teams in advance

While we did a good job of planning out our app design and architecture ahead of time, we were less proactive about anticipating our need for product specs, designs, analytics events, translations, etc. This was especially painful for product specs for pre-existing features that were originally written without any clearly defined acceptance criteria. While some of this could be derived from observing the existing app, defining edge cases, loading, and error states often required lengthy back and forth conversations with product and tech leads.

Scaling up the team

We started the rewrite while we were still actively recruiting two more Android developers and did not anticipate the time investment of having a couple interviews as well as follow-up discussions per week.

Assigning a dedicated project manager

This role shifted during the rewrite process. For the first couple of months, I acted as project manager to our team of 10, defining our sprints, assigning tasks and tickets, and attending roadmap meetings several times per week to report progress and roadblocks and see how we could better optimize the schedule. This was a lot to juggle on top of my own work as an individual contributor (rebuilding the subscription flow and migrating to version 2.0 of Google’s billing library). Around month 4, a lot of this responsibility changed over to one of our tech leads and we got a lot of help setting up specific filters and dashboards to track our progress as our target release date drew closer. Having a full-time project manager turned out to be invaluable to us and something I wish we had sourced from the beginning of the project.

Developed a QA plan

Our team has been in the process of moving away from manual QA in favor of automation. Although our roadmap for the rewrite included padding for devs to write espresso tests to add to our automation suite, we knew we would not fill in all the gaps within our time frame and decided to make this compromise and dedicate more time for tests after the release. In the interim, we relied on our unit tests and a couple overseas QA contractors, who would still perform manual testing for us. It was not until late in the roadmap that we started mandating team-wide bug bashes for all new features and templatizing the documentation around what each bash should cover. Introducing a process change like this more than halfway through the rewrite process caused some friction, as everyone had to get up to speed on how to run a successful bug bash and understand the protocol. Overall, these were a net positive, yet we should have set feature bug bashes as a standard from the beginning of the project.

Communicated with the API team

While client and backend developers tend to work closely together at Headspace, this is definitely an area where we fell short. We took all of our existing network calls and re-implemented them in the new project in blind faith. We ended up migrating over a handful of legacy endpoints that returned JSON API (a headache to deserialize compared to plain JSON) and creating a lot more work and tech debt for ourselves down the line. We also implemented at least one endpoint that was soon after deprecated. If you think about it

(and I try not to), we probably accumulated at least 2 weeks of tech debt work by not having a preliminary conversation with our friends on the API team. Had we done this, we would have had a roadmap as to which endpoints were soon to be deprecated and which should be versioned to return plain JSON.

Where we’re at now

We’ve had the new app rolled out to 100% of users since July and our devs are fully re-integrated into their squads and charging ahead with new feature development! We took a couple weeks after the release of Milestone 2 to address the most severe bugs and build out espresso tests for our most critical flows. While we are still not fully dependent on these automated test runs, we do have them integrated into our CI process and continue to add coverage. We still have a decent amount of tech debt (who doesn’t?) and we have accounted for it in the roadmap for the remainder of the year. Overall, I am super proud of our team for all the hard work they put in over these past 8 months, and I think everyone is enjoying developing features in a codebase with a clearly defined architecture pattern. With a more clearly defined design pattern in place, feature development is faster and much more streamlined. By using the Navigation Component, we have eliminated the crashes related to Fragment transactions that used to haunt us. By prioritizing unit tests and enforcing an 80% coverage rule, our overall confidence in the app is higher and I’m pretty sure we are all sleeping better at night. From a metrics perspective, we have seen our crash-free percent jump from 99.09% to 99.87% and our Play Store rating has gone up by 5% in the last quarter. Nothing is ever perfect, but I think we built a pretty great app in a pretty short amount of time and under some pretty unprecedented conditions.

Rewriting in the time of COVID-19

What we did well

What we could have done better

Where we’re at now

Written by Headspace