Good stats and marked improvement between 2014 and 2015.
Mike Mason
1

  • The main reason is that with time we’ve zero’ed in on how to build our systems in a way that limits the impact of hardware or service failures with regard to what the customer sees.
  • We decided two or three years ago that we would no longer due scheduled maintenance, and unless absolutely necessary we would conduct all maintenance in such a way that we did everything possible not to take the site offline. We’ve created many tools and processes for doing this like https://github.com/basecamp/intermission and https://github.com/basecamp/mysql_role_swap.
  • Over time we have developed an incredible monitoring and alerting system that is backed by millions of metrics. We have dashboards for every key system and component and our logging is robust enough that we can usually find the source of a problem even if we missed it before an event.
  • We used to store all of the files (for each application) on an NFS mount backed by Isilon storage. Our storage cluster was plagued by performance and reliability issues and when it had problems it literally would take all of our sites offline. (We’ve since moved almost all of the applications to a new storage product and are free of these issues.)
  • We’ve built out our network edge and carefully picked the Internet/IP providers that work the best for our business. Over time we’ve leveraged our network to limit abuse of the products. (For example automatically detecting and blocking bad API clients.)
  • Our internal documentation and process has gone from an after thought to a “must have”. We document our work, we ask for peer review, we rehearse large procedures.
  • Where possible we’ve automated every process we can. For example we can reroute Internet traffic or lookup the top 50 application consumers via a chat bot. Eliminating “fat finger” mistakes goes a long way.
  • I’ve been meaning to write a separate post on this, but there’s something to be said for quietly beating the drum until you have enough time to learn to play jazz. What I mean is that we’ve been ruthless about getting rid of systems and processes that were inadequate for the level we wanted to operate our sites at, and it didn’t happen over night. We had to do everything we were doing (maintain the level of operations), support new products, and then build and revise (for the longer term improvements). With each improvement we gave ourselves more headroom to spend more time on the remaining issues. Less fires mean less fire fighting.
  • One example of this leverage is that we actually run the latest version of Basecamp live from two data centers (Chicago, IL and Ashburn, VA). We regularly move traffic between data centers and our customers are none the wiser of these changes.
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.