Every technology startup I have been at has had the same problem: how do you build things as quickly as possible in order to achieve product-market fit, and then how do you make sure you can smoothly and rapidly scale once you hit it?
Some of you might arch your eyebrows — typically it’s either/or: you build things fast, or you build things for scale. If you build them fast, you more often than not have to rebuild them when you start to scale. It’s a cost that makes obvious sense to incur (without hitting product-market fit, there will be nothing to scale). However, there are indeed ways to reduce that cost and enable a smooth transition from pre-product-market-fit mode to growth mode. In this post I’ll dig into a few of the strategies I’ve developed on this subject.
1. Embrace the DevOps mindset on day one.
The DevOps mindset is all about giving code ownership to developers from first line to final shipment, while providing support to protect against egregious mistakes. At established companies, the focus is typically more on the second half of that statement: don’t screw up. This typically means DevOps teams focus the majority of their time on heavy automated testing, security reviews, load testing, etc.
At startups, however, the focus is more on the first half of that statement: move as quickly as possible. That means DevOps teams need to focus on things like:
- Tools to let your developers ship code in minutes
- Easy set up of development environments
- Ephemeral testing/staging environments to support parallel development tracks
- Feature flags to decouple deployment from release
These things are easy to dismiss as tasks that a traditional software engineer can work on as they identify bottlenecks. But an engineer is best with headphones in and nothing but the keyboard between them and deploying a new feature; having to stop halfway through to work on a build tool or new deployment method is a reliable momentum-killer.
For this reason, I actually tend to hire a dedicated DevOps engineer very early on in the life of an engineering team. Their lone KPI: cycle time. It’s their job to create the tooling and processes to compress the time from when a ticket is picked up to when it’s shipped to production. Once things start to click on the business side, this person can seamlessly expand focus to responsibilities like scaling, security, and general infrastructure administration tasks.
2. Even if you don’t build for scale, measure for scale.
Let’s face the harsh reality: when you get users, things are going to break. And nothing is more humbling as a platform engineer than a customer reporting a major problem that you were not already aware of. Every mature technology product needs the logging and monitoring infrastructure in place to identify errors when or before they affect a customer, and provide the traceability to facilitate a quick fix.
The problem is: shoe-horning in a reliable logging or monitoring system to existing code (especially poorly-architected code as pre-market-fit software tends to be, with the focus on development speed rather than maintainability) is tough, requires significant refactoring to be done right, and inevitably still results in blind spots. It is much easier to embrace this mindset from the get-go: spend the time up-front to spin up an ELK stack (or better yet, pay for a service so you don’t have to worry about it and can focus on building) create the wrappers and helper functions to make it trivial for an engineer to fire off a metric or a log line, and start measuring everything. With the proper architecture, you should get automatic logging, latency measurements, and error reporting automatically with any new code that you ship, so when you start getting real users on the platform, you won’t have to scramble around putting out fires that you can’t see.
This also means you should start practicing proper on call/incident response protocols as soon as possible. Before you hit market fit, incidents will most often come from your sales team having issues in their demos or your infrastructure breaking during testing. Even though they aren’t “customers”, they’re important, and you should treat them as such. Then, once you start to get users, your team will be used to reporting incidents in this way and your engineers will be used to being on call, making it an easy transition.
3. Achieve horizontal scalability nirvana
My number one rule for platform and infrastructure teams is “don’t get nailed by a reasonably predictable problem”. It is the job of those teams to make sure that your system continues to function smoothly as you grow. Oftentimes, though, preparing for anticipated scale (or worse, reacting to unanticipated scale) requires changes in architecture or underlying technology that take days, weeks or even months to implement. If you’re in a rush, as you inevitably will be once your sales start hitting, I’m sure you don’t want to be up all night setting up database shards.
The way around this is to have a failsafe: if all else fails, make sure you can throw money at a problem and survive the time it takes you to solve it. And this means you need to embrace managed services and horizontal scalability. You have achieved the elusive “horizontal scalability nirvana” when you can literally handle any amount of traffic thrown your way just by increasing the amount of processing power you have deployed. Then, it may be expensive, but you’re not going down. That’s an excellent crutch to lean on.
Tactically, I recommend using as many fully managed “pay as you go” services as you can:
- Host on Heroku so you can scale by hitting the “+” button. Or, if you need more flexibility or security, throw your platform on AWS Fargate. OR, for the ultimate nirvana experience, put everything in a serverless architecture and then go take a nap.
- Queues are your friend. Use them liberally.
- Hosting your own database is a nightmare. Managed SQL databases (e.g. RDS) are better, but you still have a single point of failure and a built-in limit to your processing power. Services like DynamoDB and Datastore are best — you may be sacrificing flexibility and lower cost but you literally do not have to worry about being able to handle your traffic.
You don’t even have to implement any sort of auto-scaling to get the benefits of this approach. Always knowing that you can quickly log into your cloud provider console and bump the number of workers is priceless. And, when that wave of users finally comes in, you’ll rest easy knowing you can handle them while you plan for cutting costs and sharding your database.
The above strategies converge on a great motto for tech teams at early stage startups: stay agile, stay fast, and stay confident. You want to be able to move quickly while not worrying about screwing up, and set yourselves up nicely to scale without losing sleep. I’d love to hear about other strategies that have worked in this environment!