QCon 2018: Will Larson on how to successfully run a migration to tackle technical debt

Here are my rough live-blogging notes from the QCon presentation Paying Technical Debt at Scale — Migrations @Stripe, by Will Larson, who leads Stripe’s Foundation Engineering team, which works on data productivity and infrastructure tools. A lot of the talk seems to also exist in blog form here.

As you grow as an engineer and as a company, managing technical debt becomes sort of a constant.

Some definitions:
Tech debt: core constraint on your velocity
Migration: fully replacing a tool, system, product; only way to deal with tech debt at scale
Approach: treat every migration like a product

What is a migration?

Do migrations matter?

As age of codebase increases, productivity decreases.
Code review, continuous integration, linting, typed schemas can help bump up productivity, but then you reach the “trough of sorrows.” To get out of the trough you need to do some migrations.

For most big changes, most people agree that it should be done. But timeline and priority coordination between teams can be really tough.

Other failure modes:
- sequentially doing several migrations
- teams (especially API teams) spending most of their time on migrations instead of new product development
- failed migrations

Examples of failed migrations:
- Hyperbahn project at Uber: failed project
- Digg v4: migrated to Cassandra. Company ran out of funding. (“There’s this saying that ‘Cassandra is a Trojan horse released by Facebook to ruin a generation of startups’”)

If you have strong interfaces between teams, most implementation for a migration might be within team scopes. But if you have weak interfaces, most implementation will be within shared scope.

How to do an effective migration?

Derisk

Place good bets. You can only do so many migrations at once; you can only do so many migrations ever.

Is this worth doing?

  • Find a sponsor: Not exactly an executive sponsor (you can convince them of anything if you try hard enough). Instead, find an engineering team that’s busy and willing to prioritize your proposed work.
  • Opportunity cost: Is this the most valuable thing you can do? These are among some of the most important strategic bets. What you pick will affect the company’s future.
  • Not invented here — avoid this fallacy
  • Hammer looking for a nail — avoid this too!

Will this work?

  • Design documents: The purpose is three-fold. 1) Prove to yourself that your approach is viable. 2) Prove to your customers that this is viable. 3) Go find the detractors and convince them that it will work. A good design doc enumerates not only why the design will work, but also why it wouldn’t work.
  • Prototype: This is about de-risking, finding out whether a solution is possible. Take a couple of hours to verify some crucial part of your approach.
  • Embed with early adopters
  • One easy, then hard: First get something easy to work. Second, work on something hard. The goal is to make sure that finishing is actually possible, instead of realizing far down the line that your solution is not tenable, and having to migrating back multiple implementations.
  • Example: Stripe was for a period of time the world’s leading expert on a particular old version of mongoDB. So when they needed to upgrade the MongoDB version, they started with a small dataset. Then they tackled one of their most complex instances (complex sharding, uncommon access patterns, ….) — once that succeeded they knew they could successfully migrate everything in between.

Enable

User testing: you’re not testing the product (migration), you’re testing the adoption of the product.

  • Interfaces: Get to the user as soon as possible.
  • Documentation: See if someone can read your documentation to do the migration. Is the call to action obvious and reasonable?
  • Operations: Force people to get used to using the system. Chaos Engineering, debugging are ways to preemptively inject faults so that people learn how to fix.

Self-service: make it possible for people to solve their own problems if you can

  • Automate the migration
  • Incremental and reversible (e.g. dark launch)
  • Interfaces. Interfaces. Interfaces. — bad interfaces make migrations hard. good interfaces make migrations easy. Example: MongoDB has bad atomicity guarantees. There are ways for mongodb to lose data depending on configured consistency levels, and primary/secondary structure. A lot of work had to be done to wrap around Mongodb testing in order to ensure successful migration.

Finish

How do you fully finish the migration?

  • Stop the bleeding
  • Tracking: building metadata about the project. 1) tickets ( don’t make tickets buy hand. build a tool to make the tickets), 2) reports.
  • Nudges
  • Finish it yourself
  • Good example of finishing a migration: Amy Nguyen (Stripe), Cory Watson (Stripe) on “How to break up with your vendor
  • Celebrate when it’s over: build a culture where your engineers are rewarded for their work on migrations.