How defensible managed a surge of interest in the California fires with 100% uptime.

Defensible
3 min readAug 26, 2020

--

What a week! Now we know the exhiliration and anxiety of experiancing a social networking virality event. This, coupled with the actual fires and smoke surrounding the team here made for a memorable few days.

Let’s take a trip into the founding of this site. Post the paradise fire of November 8, 2018 we created defensibleapp.com as we knew with increasing frequency the fire danger and the fires themselves were going to be in focus for Californians. We researched the datasets available and created a model that we have a level of confidence in that coorelates a general fire risk to properties & buildings. Later we added live fires and more or less put it on autopilot.

So as this fire season approached this year we really didn’t think much of the site, it was a project we built a year+ ago and though it was a nice UI but hadn’t really generated much interest, that all changed with the “Fire Complexes of August 16th 2020”. Multiple fires all over the Bay Area, Santa Cruz & Napa / Sonoma. We woke up to see this starting. From a normal load of 1–2k users a day, 30k then the next day 125k. All told about a quarter million unique users.

Now because this was never a money making endeavor and required a general purpose computing, the site is hosted on a single bare-metal server. This is a fairly heavy site, we’re loading all those beautiful tiles directly as well, not using Mapbox or similar. Luckily, we had put Cloudfront in front of it, and aggressively cached it. Also, since this is a localized event, most of the vector tiles were loaded in the same general area, giving the cache a chance to be very effective. So, this gave us an incredibly high cache rate which really saved the day.

So the end result was a this, 100% Uptime. The HTML page itself was a static page, no database running on pageload and so we’re able to deliver a snappy sub-100ms load to all users. Cloudfront’s Free Plan TTL is around 2 hours, which mean we’d have to rolling request new tiles as they timeout.

So it wasn’t all unicorns and rainbows. Logging into the server the average load was north of 40. When a tile wasn’t cached at the peaks, we likely we either serving very slowly or not at all. This is something to look into before next fire to have a queuing system to have a better control over the tiles requested and not overload the system.

We didn’t dare purge the cache during this time, so as we rolled out a few small new features that were requested by users, for example adding the hash url so users could share their specific view, we just pushed changes to the code and let the change happen as the cache expired.

--

--