Launches are tough, no matter how prepared you are. Launches with a lot of press or a lot of new code are only worse. Although Potluck’s launch had as much press involved as the Branch.com launch last August, we had a much easier time launching, dealing with bugs, and keeping the site up. Here are a few things that made life easier:
1. We had a plan on how to scale our infrastructure.
We use Heroku for hosting, which meant we could scale our front-ends as necessary on demand. Most things on Heroku are easily scalable (app servers, worker processes), and for everything else we did some back-of-the-napkin calculations and scaled up beforehand.
For our Postgres server, for example, we figured out which Heroku Postgres plan we needed by looking at how much cache we currently used (there’s a great article with more on that here) and extrapolating it out to how many users we expected in the first week of launch. The calculations were very rough, since it’s pretty easy to upgrade and downgrade plans with most cloud services these days. Upgrading before launch, though, meant that this was one thing we didn’t have to worry about during launch day.
2. Features have simple flow-paths.
One mistake we made with Branch was to make features overly complicated. This made features difficult to build, difficult to test, and difficult for our users to understand.
Our signup path this time was a simple, linear flow. This not only made it easier to code and write tests for, but also meant we didn’t have to click-test 5-10 different cases whenever we pushed any changes.
3. We had ample time to QA and bug-fix.
This sounds like a no brainer, but we’ve had many a launch where we’re adding features and making cosmetic changes to the site well into launch day. They all seem important at the time, but having a code-freeze several days before a launch is helpful for releasing a polished product and keeping morale high going into launch. With the Potluck launch we had five days between our code-freeze and launch where no new feature code was allowed, giving us almost a week to fix bugs. A day before launch, we had a bug-fix freeze, where even bugs were left unfixed unless they were mission-critical.
These five QA days were also helpful for finishing up some last minute todos. When you’re in a crunch, it’s easy to forget certain seemingly-obvious productionizing tasks. Are all the queries using indexes? Are we storing something in the DB as a serialized string when we shouldn’t be? Do we respect email unsubscribes? Each of these seem minor alone, but they can add up quickly as support emails stream in post-launch.
4. We had visibility into how our app was functioning.
New Relic was great in helping us figure out when we had to scale up our dynos. We also used StatHat to track nearly everything. Instead of trying to figure out what metrics we’d need to track, we tried to write a stats library where tracking a stat cost nearly nothing. Then we sprinked the tracking calls everywhere we could (see image).
Stathat produces beautiful time-series charts for any stat you send it. During launch, we kept a few of these graphs up. This kept us informed on how launch was doing and alerted us if there was any abnormalities, especially after deploys.
5. We tried to avoid adding complexity when productionizing the app.
In previous incarnations of our stack we had a lot of moving parts, making it difficult to reason about what was going on at any one time. With Potluck, we tried to keep our code and our stack as simple as possible (Rich Hickey has a great talk on simplicity where he defines simplicity and why it’s important), paying particular attention to areas where it wasn’t immediately obvious what was going on or how to debug issues.
For example, we did a lot Russian Doll caching with Branch. Although this made the app more responsive, it also added a lot of mental overhead when reasoning about the app’s view layer. We had to consider how each html fragment was cached and how the cache expired, and it made it difficult to work in the codebase and find and fix bugs. With Potluck, we only action-cached one or two of the highest-traffic endpoints. The gains were substantial, and at a much lower maintanence cost than our previous attempts.
Lastly, and possibly most importantly, although much of the the above came from experience, it was also possible because the weeks before weren’t a stressful, adrenaline-fueled sprint towards launch. Although launch sprints seem necessary, we found that when we pulled 70-hour weeks, we got done a lot less than we thought and almost always ended up with a lot of technical debt to clean up later. Planning and a focus on simplicity meant we could celebrate our launch night instead of furiously fixing bugs hopped up on 4-hour-energy shots.