In the face of demanding roadmaps, use “pay-as-you-go” to invest in tech improvements

Willy Xiao
9 min readAug 23, 2023

--

Have you ever thought?

  • “If I just had an extra month, I would totally crush this tech debt…”
  • “Leadership doesn’t understand the tech issues deep enough, so they don’t get the value of what I’m proposing…”
  • “We have such aggressive product goals, I never get time to build the system that would solve a lot of tech issues at our company….”
  • “Tech improvements aren’t as exciting as product launches, so PMs won’t let us do it…”

Part of your value-add as a senior / staff engineer is that you have a lot of tech improvement ideas. It’s your job to set technical vision and make architectural improvements — but no one’s giving you space to do it!

In this post, I’ll give you a framework to get alignment, build momentum, and finally make the tech improvements you’ve wanted to for a while.

Not only will this scratch an itch, doing this well is a critical component to improve product velocity. Ultimately, every tech company that scales needs to do this.

Note: In this post, tech improvements can be any of: developer experience, technical debt, new system architecture, migrations. Anything that is not directly related to direct product output can count.

Note: this is not panacea and your mileage may vary. It is just a thought about how to make progress on tech improvements.

Use a “pay-as-you-go” framework to make tech improvements

Instead of thinking about the tech improvement as one big investment, break your work down into 3 stages.

  • Stage 1 — Initial Alignment & Setup. Align key stakeholders on your technical vision, do the initial set-up work to enable everyone to follow it.
  • Stage 2 — “Pay as you go”. Make incremental progress towards your technical vision over time.
  • [optional] Stage 3 — “Last Mile”. Sometimes, you need one final push needed to finish off the legacy system.

You can see a detailed discussion of each below, but first:

Why this works

The root of the problem

When you are unaligned on priority with a decision-maker, sometimes it is because you don’t share the same long-term vision. But, more commonly, it is because you disagree on:

  • “is this work worth it?”
  • “is this work worth it to do right now?”

Decision-makers are investors. They have a limited pool of resources that they can invest in a few items.

Some tenants of investing

Using this analogy, here are some important tenants to keep in mind. Some may be obvious already but it’s worth noting:

  1. First, the value of making technical improvements is to enable the company to deliver more business value, faster. Every engineering function at a company has some repeatable flow of business value that they need to produce. Usually for product companies that’s to create more user value: produce new features, to fix bugs, to improve discoverability, etc., To get anyone who’s anyone to care about your tech improvement, it must be framed in a way that talks about how to produce more of that value. The more precise and concrete that you can describe the business impact of your tech improvement the better. How is it measured? What is the timeframe of return on investment? Of course — employee morale is important, “best practices” are important, engineering prestige is important, security is important — but they are a means to an end.
  2. Bigger investments are riskier than smaller investments. This is a no-brainer. If you’re wrong about your investment, smaller investments are less painful than bigger investments. If possible, you want the option to claw-back the remainder of your investment when you realize that the investment thesis was wrong.
  3. The present value of resources is worth more than the future value. If I can get the same upside today with the same cost tomorrow as it is today, then I would pay the cost tomorrow. In the same way, I don’t want to make big “block” investments today if I don’t have to.
  4. You want to pay for something when you see value from it. In general, it is much easier to ask someone to pay for something when they see value in it. Rather than having them pay today for potential value in the future. In some sense, there’s some goal to do “liability matching” too.

Read below for more explanations, examples, and tips on how to pull this off!

Stage 1 — Initial Alignment & Setup

This stage is where you spend time convincing key stakeholders that your technical vision is the right one and that it is desireable. Sometimes it’s your manager. Sometimes it’s an architecture council. Sometimes it’s code owners on other teams.

You aren’t doing all the work here but are describing the parameters about how much work it is, what the expected outcome is, and what the risks are. You want to clearly describe the vision and the strategy.

In this initial stage, you are also doing the “technical setup” work to enable the rest of the pay as you go model, this might actually take substantial time on its own too.

Examples

I will use 3 tech improvements at Envoy (I’m an Engineering Manager there) to describe these three stages:

  • Federated GraphQL. As Envoy has moved to micro-services, it successfully migrated to a federated graphql setup. Some services at Envoy support over 1,000 QPS through GraphQL, so this was a substantial project. In this stage, the senior engineer leading the project convinced stakeholders it was the right move, set-up the initial gateway, and even put up an example of how to produce a service which exposed a subgraph. This took substantial effort, convincing, and vision to pull-off.
  • Typescript. Envoy has one major FE repository called garaje. This initial work involved convincing the core garaje maintainers that typescript was useful / desireable / possible. There was already one attempt at doing this migration, which failed, so we needed to overcome that inertia. This stage also included setting up the initial tsconfig.json and ts-lint to enable typescript to be used in the repository.
  • Ensuring 0 secrets referenced in code. There were some strings in code that could be secret environment variables (some were actually, vast majority weren’t). The Security Engineer working on this project had to resolve all of the highest priority issues, convince all EMs to prioritize this work, explain the process for rotating secrets if they were real, and setting up a commit hook to detect new secrets in all repositories.

Tips

  • As a reminder, generally your goal should not be to get consensus.
  • If you can get enough alignment with leadership that “this is good”, supporting this cause might even be considered as something engineers can put as part of their performance reviews (“I helped with XYZ migration”).
  • There are many tips on how to create technical strategy, but one of them is that you want to define your Canonical Everything. You are now the champion / leader of this project. Congrats & good luck!

Stage 2 — “Pay as you go”

This stage is where most of the work is done.

In general, the better you did your initial set-up the “easier” it is to use the new system. This is often a project that takes many weeks, months, or even years. If you did this well, you don’t need to “block” time to do it as part of roadmap planning.

In the ideal world, you should be reaping the rewards of this work as you go.

Examples

  • Federated GraphQL. There were generally 2 different “domains” that were pay-as-you-go here. The first was services: immediately 2 new services being used at Envoys routed GraphQL traffic through the federation server instead of each service’s own GraphQL layer. The second was frontends. Envoy had a couple of smaller FE deployments that were good guinea pigs that were used to hit the federation server too instead of individual services. Over the course of many months, we incrementally moved both more services to publish to federation and more FEs to use federation.
  • Typescript. The expectation is that all new files use typescript instead of javascript. Each PR touching a vanilla javascript file should at least consider (or explain why it doesn’t) use typescript. Senior typescript developers often have to step in to teach folks how to migrate specific, tricky patterns. Generally, we try to start with the “leafs” of the dependency graph. Often in post-mortems, an error was found that would not have existed if not for typing — “convert to typescript” is a follow-up to those.
  • Secrets in Code. Each team is responsible for solving all of their easiest secrets in code first (ie: those that are only detected to be secrets, but are actually expected, public-facing UUIDs). Teams are asked to create an OKR to tackle a certain amount of their secrets. No need secrets should be added to the codebase.

Tips

  • If you can point to specific value with every single migration, this is really good for the excitement and value of your overall project. You should be able to “reap the rewards” as you go. If you can’t, maybe there’s a question as to the value of your migration at all.
  • Often, this principle is useful: “All new things should immediately use the new convention.” Any time an existing item is touched, it should be converted to the new convention. You can’t be dogmatic about this and you want to give people a way out, but generally this should be a starting principle so that you’re not introducing new systems in the “deprecated way.”
  • In doing these migrations, you might encounter blockers that as the leader you will have to modify your strategy, help to resolve it, or rethink your plan. In a bad case scenario, you might even decide to cancel the migration here. Hopefully you thought about this and can decide whether you should “go back” if you’re halfway through.
  • If you are deciding to commit proactive time to this migration (not just do it as you go), you should focus on optimizing for cumulative value over time.
  • Post-mortems action items are a great opportunity to garner excitment, to point to impact, and to get time alloted to do migrations.
  • Here, you will need to regularly repeat your vision and mission over and over again.

Stage 3 — “Last Mile”

Often when doing tech improvements, you will realize that there is one last little bit that’s particularly tricky to finish off. The “last 1%” that’s 50% of the time on this project. If you’ve decided that that’s “worth it” to finish off, then it is something where you might need synchronous / blocking time to do.

Examples

  • Federated GraphQL. Today, old mobile clients still hit individual services’ GraphQL schemas instead of the federation server. We need to decide if it’s okay to deprecate old mobile clients and totally complete the move to the federated graph.
  • Typescript. Some files in our FE framework (Ember) cannot be converted to typescript. The core maintaners are improving this, but likely these javascript files will live forever. That’s okay and it’s not causing us much harm.
  • Secrets in Code. There are a couple of final places (non high-pri security issues) that need one final push to clean-up, so all EMs are asked to prioritize this work for their team in one final push to finish it off this quarter.

And…that’s a wrap!

Often, when I see engineers with a really clear vision for a technical improvement but can’t get organizational alignment or prioritization to work on it — it is because the work is presented as one big “block” of investment. Even when the impact of that investment is clearly described, leaders are still hesitant to invest in it. By breaking down work in this way, you’ll end up getting more alignment, better momentum for your project, and keep improving your team / org / companies tech stack as you go. Good luck!

Some more general tips about gradual tech improvements

  • Sometimes, this takes a bit of hustle. You often believe your vision before others do — otherwise it would’ve already been done. You might need to get scrappy to show value; you might even get started on it without alignment as long as no one “strongly disagrees” with the vision. You might need to put in extra time. You might need to find 10% time.
  • Usually, the only parts of these improvements that show up on roadmap planning are the “initial alignment & setup” and the “last mile.”
  • Once trust is built with leadership, bigger “blocks” of investment are easier to make.
  • At bigger companies (e.g. Facebook, Google), this is the only way it works. There’s no way you can pull off a “synchronous” tech improvement for just about anything at large companies…
  • Sometimes, you never finish a migration. That’s okay. Facebook started adding typing to its PHP monolith over ten years ago. Even afterwards there are still files that will be untyped; most are “no-harm / no foul”.
  • If your improvements requires 100% adoption to be worth it; if it’s all-or-nothing, it’s pretty hard. Potentially, it is not something your organization would ever invest in, and I would urge you to reflect on its value too.
  • The ability to gradually migrate over time is actually the reason some technologies succeed over others at all. Typescript, Kotlin, SwiftUI work in some sense because they do allow incremental adoption.
  • This strategy isn’t limited to just coding conventions and styles. You can even do this for new product launches — where you do parts of the system before others.

--

--