Mid-air Refueling and Optimizing for Scale
You’ve found product market fit. Customers love your product. New customers are signing up and trying it out. Existing customers are ramping up usage. Everything should be going great, except it isn’t.
Things are breaking, and your technology challenges feel akin to mid-air refueling.
The scale at which you built your MVP can’t handle the scale you’re supporting today, much less the scale you’ll need to support in the upcoming months. Request latency is increasing, Nagios alerts are going off all the time — including in the middle of the night. Customers may or may not be blind to what’s happening under the hood, but the system is definitely straining.
The polish on your MVP is lacking and bugs are piling up in your support queue. You’re unsure how to prioritize but you’re squashing issues as fast as possible. Yet somehow, input volume of bugs exceeds your ability to fix and remove them from the queue. The problem continues to compound.
The technical solution and implementation to fix these problems will of course be completely bespoke to the problems and system you’ve built. In our previous post, we presented some basic principles that applied to pre-product market fit development. Here, we’ll do the same for scaling your post-product market fit system.
The shortest path between two points is a straight line. Yet when upgrading a complex system, taking the most direct route forward may not be the safest approach.
Consider the database powered search you built for your MVP; it’s killing your database, it’s slow, and quality sucks. You need to upgrade to a standalone search service like Solr or ElasticSearch.
After reading the Solr 101 documentation, the problem seems trivial at face value. Yet upon further digging, challenges arise: your search data spans multiple columns and multiple tables, each of which can be updated at any time and must in turn be updated in your search index. Furthermore, the lexicon of your index includes domain specific part numbers, which you don’t think will parse properly with the standard stemmers included in the search engine.
So instead of writing the system to collect all updates directly into Solr and serve back results with a standard stemmer, you instead choose a path that looks something like this:
- Maintain explicit denormalized tables in your database that reflect the schema of your search index and focus on solving the update problems first.
- Write a script that verifies that these denormalized tables are in fact correct.
- Switch your database search to run off these denormalized tables.
- Start writing data into your search index from these denormalized views.
- Turn on search index with no tokenization, and verify that result set sizes are the same as your database search.
- Tweak your stemming algorithms and iterate until you have something that’s good enough.
Without getting into detail, at a high level what we’ve done here is we’ve broken down the solution given the specifics of the problem at play. Each step is digestible, and getting to step 6 may require twice the work required if you tried to shortcut directly there.
Yet since people are now using your product and depend on it, uptime now is a big deal, and the additional work required is a good tradeoff for mitigating risk.
Start measuring everything
You didn’t find your MVP by A/B testing, and you probably had little to no logging when your first demo prototype was completed.
Now that your system is scaling along with your userbase, invest in logging, graphing, and dashboards.
Graphs are important for both identifying change-related regressions, as well as for visualizing historical trends that need to be addressed. Tools like StatsD allow for simple graphing without the need for logs, and products like Splunk or Kibana can sit on top of your aggregated logs and provide a simple way to generate charts.
Finally, dashboards are the glue that allows your data to tell a story. Create a deploy dashboard to watch after code deploys and measures average response latency, etc. A search dashboard might measure requests per second, 95th percentile latency, and execution times for search-related database queries. We’re big fans of the “technology-less” dashboard, which effectively consists of static html, embedded graphs, and maybe even an iframe or two ;)
Plan for a 12-month shelf life
Now that your product is “working”, it’s tempting to design for the long-haul and build that parallelized search cluster that can scale indefinitely. You’ve studied Netflix architecture and are convinced that you can build something similarly great that can support 300MM concurrent users.
The reality here is that while your product may be in a post-product market fit state, the technology is still very much in its infancy. Over the next year and beyond, you’ll attract new customers, and the demands of the system (as well as the product) will change significantly.
The rule of thumb I generally go by is to expect a 12 month lifespan for most code you write. Given this, you should invest effort in ensuring that your code and associated systems can be rewritten and re-architected in the next year.
Scaling your startup’s technology isn’t easy and staying as pragmatic as possible is critical. Bleeding edge technology, complex architectures, or mathematically elegant solutions may be intellectually appealing, but won’t get you any closer to the problem you’re trying to solve.