The Startup CTO’s Guide to Ops (1 of 3): Guiding Principles
In this series of three posts I’ll discuss the operations setup for an early-stage startup; and the things that can be punted. This is based on my experiences at several startups I’ve co-founded, as well as at several larger companies.
While I love reading Hacker News posts about the amazing infrastructure at successful companies, I worry that these discussions may encourage an over-emphasis on perfection and scale. Most of us aren’t at the size where we face the same level of technical challenges, and the impact of things going wrong may be negligible. In this series I take a scrappy perspective to investigate: what is a minimum decent starting point?
My target audience is the startup CTO who needs to pick an initial infrastructure that includes: developer setup, deployment system, testing, monitoring, metrics, and production configuration. This discussion may also be of interest to DevOps-minded developers — the kind of folks who believe that these tools and systems are not “someone else’s job” but a shared responsibility.
This post (part 1 of 3) will cover guiding principles and requirements. The second post takes a tour through my operations toolbox. I will end with an example production and deployment setup.
At the highest level, our goal is to stay focused on our business and spend the least amount of money and time possible on our ops infrastructure while ensuring performance, availability, and future scalability. We also care about streamlining the developer experience to improve efficiency and morale.
What I Care About
- This is a business, not fine craftsmanship. Most startups fail for business and product reasons, not because the software was poorly written. We must stay brutally focused on getting the business off the ground, even if we make technical decisions that offend sensibilities of how things ought to be done. To make a carpentry analogy (my other hobby), even an expert takes shortcuts like leaving the back of a dresser unfinished, or makes context-appropriate decisions like not using nice walnut lumber to make a workbench.
- Business metrics are paramount. Customer funnels, features, and new UIs should be instrumented so you can measure what works and what doesn’t.
- Never fly blind, always have monitoring. I have never regretted investing time in tools that automatically check for problems and send proactive alerts. It’s easier and safer to build out complex systems if you can start with building blocks which you can trust, because you’d know if there were any problems.
- Move fast and learn. Any new venture requires a bit of stumbling around in the dark as you learn the realities of your business and customers. An investment in robust operations reduces the penalty of trying stuff out. You can quickly try new ideas, knowing that your tools will quickly tell you if something doesn’t work; and you can always rollback.
Many decisions are reversible, two-way doors. Those decisions can use a light-weight process. For those, so what if you’re wrong?
— Jeff Bezos
Things I Don’t Worry About
- Scaling is not a problem. Even a single machine would be sufficient to get most startups off the ground. As long as you have good monitoring and a plan for how you could scale, then don’t waste the time and complexity of building for scale out the gate. (This article talks about this more, and also about picking the right kind of traffic).
- A single point of failure is not a #fail. It’s OK to have a single point of failure as long as you know it’s a single point of failure and you can accept the consequences of that thing going down. For example, if you go down at 2pm on a Tuesday for an hour, what is the dollar impact of the few hundred or thousand visitors who cannot access your site? Does this warrant the costs of ensuring zero downtime?
- Zealous testing isn’t a great use of time. You need to allocate limited resources between these complementary functions: tests to prevent bugs; monitoring to diagnose problems and their severity; rollbacks to enable rapid repair; and frequent deployments to decrease risk. Tests can be expensive to develop and maintain, especially when the product and code are in flux. But there’s no easy way to quantify “these tests prevented bugs which would have cost us $x.” Because it’s human nature to over-estimate the risk and cost of possible problems, testing decisions are often based more on fear (and social pressure) than ROI. While there may be a desire to adopt a rigorous testing methodology early on, I encourage teams to instead only start with a handful of the most critical tests and instead double-down on their monitoring and deployment/rollback systems. Over time the high-ROI test areas will become apparent.
- Deployments must be easy, reliable, and not a big deal. A deployment system doesn’t have to be fancy; even a well-documented “recipe” anyone can follow would be better than an ad-hoc scramble.
- Developers must work in an environment as close as possible to production. This usually means working out of a Linux VM running on your local machine.
- The developer environment must be incredibly easy to setup. Ideally you can just run a script that installs a VM, populates it with the necessary tools, and prepares local state.
- Production hosts must use a standardized provisioning setup. There should be a reliable script or well-documented steps for adding a new host or a replacement host. Avoid hand-crafted (“artisanal”) hosts that contain important configurations which aren’t checked in or written down. Almost by definition, your development and production provisioning process should overlap.
- Have a staging environment. You need a prod-like place where you can validate production bugs, test release candidates, abuse the database, integrate remote systems, and make configuration changes.
I’m a huge believer in having specific and measurable operational metrics that commit me to what I will do and — just as important — what I won’t do. Any metric, even a rough educated guess or conservative number, is better than none at all. For a typical web startup here’s where I’d start drawing lines in the sand:
- Handle up to 10 requests/sec (not including static assets). For most startups this would represent a terrific high water mark roughly corresponding to what you’d see at the top of Hacker News. Note that Siege is a nice tool for getting an initial sense of your max load.
- Do not guarantee high availability, guarantee a fast response to problems. It’s expensive to build out high availability and redundancy from day 1 that is truly guaranteed to be high availability. You can’t assume that you’ll be safe just because you’re on Heroku, or EC2, or have multiple machines in a colo; because so many things can go wrong. Your starting architecture should be as simple as possible, and you should appease your paranoia by preparing contingency plans and being ready to react immediately if there are problems. Quantify this response time; for example. “if there is an outage, we should know and respond within 5 minutes.” As the company gets traction, it’s of course appropriate to invest in an architecture that ensures higher availability and reduces ops pain.
- Backup your data (for real). In addition to having a hot-standby replica, safeguard against deleted data or multiple-machine outages by regularly exporting snapshots off-site to e.g. S3. As a critical corollary, be sure to test your database recovery process; nothing is worse than discovering in the middle of a disaster that a standby was misconfigured or that your snapshots were incomplete.
- Track machine health. Keep trended data on machine disk, RAM, and CPU usage, and ensure monitoring will alert you before your disk fills up or a rogue process halts a machine. I usually get nervous when disk is >75% full, or there is sustained load >80% of max, or memory is >80% full. You may wish to add some intelligence to these alerts, for example to not page at 1am about an ops job known to peg the CPU.
Budget Time to Do All This
Regardless of which tools you pick and how much of your infrastructure you outsource, do not underestimate the time it will take to get everything up and running. I’d budget 5+ days to get your core monitoring-metrics-logging suite in order, plus another 5+ days for your deployment system and troubleshooting.
You Need Good DevOps From Day 1
It may seem like a big distraction to build deployment, monitoring, metrics, and other tools. This infrastructure doesn’t seem to add direct value so it’s tempting to relegate these tasks to the future. Indeed, there are plenty of companies who make good money regardless of their messy internals.
While a weak ops infrastructure won’t kill your company, it imposes a pervasive and costly tax:
- Deployments require way too many people, and are so difficult, that you do them rarely. This hurts rapid innovation and introduces the problems of waterfall releases.
- Product decisions are made without clear metrics, which can lead to continued investment in things that don’t work.
- Bits of the production systems can fail without people being aware. You never want to be in a position where your first clue that something died is a customer complaint.
It’s just too expensive to skimp on infrastructure. Moreover, DevOps tools aren’t just about the nuts-and-bolts of keeping a system running: they promote an ethos that you care about your team spending its time effectively and your customers having a great experience. A company that embraces ops from day 1 is planting the seeds for a healthier culture in the long term.
In the next post, I’ll show how you can start with a basic suite of low-cost tools to assemble a surprisingly powerful and solid foundation.