Moving Fast Without Breaking Things on a Startup Engineering Team

Jeffrey Silver
Opus
Published in
5 min readSep 25, 2020

Speed and quality are often framed as zero sum in the engineering world — an investment in one necessarily comes with a cost to the other. Recent months as Co-founder and VP of Engineering at Opus has made me think that that is a false choice. We’ve established principles on our engineering team that help us optimize for speed while giving us guard rails that help maintain a high bar for quality with little tolerance for bad user experiences.

Observability Is A Must

Great observability helps you minimize the time it takes to detect and respond to outages. This mitigates overall risk of changes to a system, which gives you confidence to move quickly. When working in low observability systems it’s easy to think about everything that can go wrong — observability lets you focus on the problem that you’re actually trying to solve.

Requirements for “great” observability depend on the system, but it likely has most of the following things: structured logging, profiles, and traces, as well as dashboards for HTTP requests, memory usage, and queue health. Custom metrics should be used to verify that each feature you’ve shipped is working. Those should have dashboards too. After that, it becomes trivial to set up monitors that alert you when anything abnormal is happening. Monitors should be actionable, and link to data sources that can help the on-call engineer debug the issue. Have a low tolerance for noise, and tweak monitors to ensure your team isn’t distracted by false positives.

Our Database Backed Queue Dashboard

Post Mortems Are A Gift

A diligent post mortem culture is necessary to deliver a high quality experience to users. When done correctly, post mortem’s give you a chance to not only prevent issues from happening again, but from avoiding never before seen classes of issues entirely.

We conduct post mortems if any of the following are true:

  1. The issue affected a core behavior for any of our users.
  2. The issue could have affected users if we didn’t get lucky.
  3. We ran into something that got in our way as we responded to an issue.
  4. The team and/or engineering culture would otherwise benefit from a post-mortem.

From there, our process is lightweight — the engineer closest to the issue writes a document as soon as the outage is resolved and we conduct a meeting the next day. In the meetings, we make sure we are going deep enough to hit the root cause of the issue. From there, it’s easy to come up with at least three actionable takeaways. We prioritize post mortem takeaways the week after we create them.

Automate Repeatable Tasks

Good tooling is essential to maintaining high velocity. It helps you move through your day to day workflow faster, and decreases the amount of frustration a team encounters due to bullshit. The best place to start is your deployment pipeline. It should be fast, reliable, and de-risk the changes you are trying to introduce. Ours is pretty simple — on each branch we run tests, lint, and type-check. Once those pass, you can merge to master, which kicks off a deployment. We deploy to a dev environment, run a system test, and then deploy to production. It generally takes less than 10 minutes to run end to end and rarely flakes.

Tooling goes far beyond deployments. We identify tasks that are well defined and repeated, and turn them into buttons in our admin panel (we use Django and rely heavily on the admin panel). We run backfills, content uploads, and support flows through the admin so tasks that could take an hour turn into a few clicks.

Bias Toward Simplicity But Know When To Do More

The general rule for startups is to bias towards whatever is simpler and more reversible. This is a great rule of thumb, but there have also been situations where we’ve opted for something that is less simple because we had a clear understanding of the direction in which we were headed. This is best seen in our business analytics data pipeline. Our goal was to do some basic dashboarding for key metrics and company OKRs that would update nightly. We were unsure of additional requirements, but it seemed like our data pipeline was going to get more sophisticated as we introduced more data sources, and more complex BI needs. We could have thrown something together in Google Data Studio, but knew that this would not give us the flexibility we knew that we’d need in the medium-term. We ended up building an ETL pipeline with Stitch, Snowflake, and Sigma, with a thin DBT layer that transforms data inside Snowflake. This is more complicated than what we need right now, but we’ve already taken advantage of a lot of features that this design offers. I have a lot of confidence that this pipeline will support our changing needs with minimal additional investment as our company and BI needs grow.

Context Is Key

A key element to our velocity is decision making speed. When confronted with a technical decision, we first try to understand the surface area and reversibility of the decision. If something has low surface area and is fairly reversible, we generally make decisions in less than an hour. If something has higher surface area or is less reversible, we try to take at most a day or two to get the information we need to move forward. It’s important to us that we make decisions quickly but not hastily.

Context sharing also improves speed by empowering more people on our team to make high impact decisions. More decision makers means that you can have more people leading more projects without risk of anyone being blocked.

We have a lot of processes in place to help with context sharing — we do a daily stand up each morning, as well as a two blocks each day dubbed “eng hangs”. It’s our attempt at mimicking casual water cooler conversations while our team is remote.

We ship features rapidly, and maintain quality levels that delight our users. On top of that, we’re confident that our technical foundation will continue to support Opus as the company grows and our product evolves. But what am I missing? What’s going to break when we add 10 more engineers into the mix? I’d love to hear from you! Get in touch at jeff [at] opus.so

Thanks to Adam Silver, Eli Bernstein, and Per-Andre Stromhaug

--

--