Enki Database Timeouts: A Postmortem

LizTheDeveloper
Enki Blog
Published in
3 min readJan 16, 2018

--

If you’re a regular Enki user, you probably noticed the timeouts we’ve been having recently.

Our team has been all-hands-on-deck trying to sort out this issue. Here’s the timeline of events:

Dec 20th: We started to notice more than the average number of timeout errors. We knew that sometimes the database would slow down during long-running jobs, but consistent timeouts were growing by the day, and many users had to try to login 3 or 4 times.

We looked at our database and discovered that a collection was growing too fast - it was far larger than the the rest of the database combined! To top it off, this collection was “hot”- frequently accessed and frequently scanned by many jobs and endpoints.

We took steps to begin restructuring this collection and archiving historical data.

Dec 31st: Our database went offline to deal with issues caused by Spectre and Meltdown. When it came back online, one of our nodes no longer worked and the timeouts problem was worse.

Jan 1: We noticed a problem with the way our web server was connecting to our database and added more robust fault tolerance. The timeouts problem had grown to be much worse.

Jan 2: We made the decision to send 503s when other open queries were waiting for a long enough period of time that we could predict that the query would timeout. This temporarily increased the number of 503s that would be returned (but would allow you to retry your action faster, meaning a slightly better user experience, because you could retry instead of waiting 10 seconds). This alleviated some of the problems, enough to make the app “usable” but not acceptably so.

12th Jan: We migrated our database to another cloud provider after finding some issues with our current cloud provider, made some changes to its configuration and made some changes to how we connect. We also performed maintenance to several queries and jobs, and did a lot of digging through logs, profiling, and updating of packages. Our queries increased in performance and speed massively, and we found that we needed to serve fewer 503s, but the number of 503s was still pretty high, despite our database now being able to handle the load. We began to investigate multiple causes.

15th Jan: Over the weekend we looked back into the code we wrote in order to fail fast, and determined that it was too aggressive. Once the root cause had been alleviated, we had to disable our stopgap. We tweaked the parameters, and 503s dropped to 0 immediately. 😅

We’re still working on a few of the scaling issues to get further ahead of this problem, but most of the infrastructure changes we needed to make have been put into place and service has been restored to its original working order.

Sorry about the issues! Since we’re a small group of passionate educators still in the early stages of product development, and we give away most of our product for free, we’re a small team with few resources. Everyone has to wear a lot of hats, so our whole team has been collaborating on nothing but this issue for the last 15 days. A lot of sleepless nights and constant hustle got our scaling issues addressed, and it was a big team effort.

I want to say thank you to all of the Enki devs for jumping in and passing the baton effectively to the rest of the team, so that we were working on this 24 hours a day but could still sleep. I also want to say thanks to our users who put up with these issues over the last couple of weeks. I know we’re all developers (or will be, some day) and we all know what it’s like to deal with issues like this, and it means a lot to us that you come back to learn something new every day, even when we’re struggling.

So long as you keep learning, we will too.

--

--

Developer, teacher, consultant, and a general technologist. Engineering Sensei