How we prevented missing SQL indexes

Simon Labute
Samsara R&D
Published in
4 min readApr 26, 2022

At Samsara, we fixed an issue where we were having occasional outages from deploying new code that would inadvertently issue unindexed SQL queries — causing full table scans and taking down a database.

Most SQL databases operating at scale end up hitting performance issues with missing SQL indexes. Some databases have systems to recommend indexes, otherwise engineers find slow queries and run SQL EXPLAIN queries to identify potential improvements.

Instead of continuing to try to monitor for these issues in staging and production, we were able to add a check to our continuous integration pipeline (CI) that detects SQL queries that won’t hit any indexes. We haven’t had a single index-related outage since. This post shares the pattern we used, and how it expanded our thinking on what automated testing can and should be used for.

Bugs that survive CI are more expensive

Some types of bugs are incredibly hard to detect statically (based on the text of a program’s code alone) but quite easy to catch at runtime. So, we often relegate finding these issues to staging or production environments, instead of first asking whether we can leverage our existing test suites to detect them.

As a code change progresses from development to production, the stakes go up for undiscovered issues. Finding a bug in production is apt to be much more costly than finding one in your dev environment.

The further an issue makes it in the release process, the more it costs.

The further along the process you are, the more time engineers need to spend diagnosing the root cause and context switching. Most importantly, a bug in production may impact the quality of the product you’re putting in front of customers.

For issues that we can’t catch in local development, a choice piece of testing instrumentation can sometimes find potential issues before they get further along. This matches up with Samsara’s culture of shifting left across our engineering processes where possible.

Detecting missing SQL indices

Missing SQL indexes is one of those cases that’s easy to detect at runtime, but not statically. We can fairly easily detect if an index was missed when actually running a fully built query against the database. But it’s incredibly hard to know (either by engineers looking manually or with some static linter) if an index will be hit looking at clauses being built up over a bunch of functions, each adding in their own conditions.

The pattern we adopted to solve this is pretty simple:

  1. In test environments, run a SQL EXPLAIN before every query.
  2. If a query unexpectedly doesn’t hit an index, fail the test.

This approach is only as good as the test coverage of the codepaths in question. If a combination of function parameters results in a query that misses indexes, we’d only detect it if the specific case is covered by tests. We’re fortunate at Samsara to have a good culture around writing tests. This means that as soon as we turned this check on, we found a laundry list of unindexed queries that we started to tackle in priority order.

We’ve added this capability to Thunder, our open-source GraphQL framework. We allow for exceptions to the index check, but an engineer has to explicitly mark a query as such (by calling FullTableQuery instead of Query). This means that we don’t get inadvertent full table scans anymore, and instead, engineers must proactively identify queries that are expected to be full scans, and code reviewers can challenge this if appropriate.

Bonus: Detecting graphQL query invalidations

We haven’t had a missing index outage since introducing the SQL missing index check, so we’ve reused the pattern. We used this approach in further test middleware for Thunder graphQL query invalidations, helping us detect poor performance there as well.

We similarly failed tests whenever a full table subscription occurred (a full table subscription invalidates drastically more queries than a finer resolution in Thunder). This meant we could tackle all our worst offenders for query performance and prevent future inadvertent performance degradations.

Query rate over the 2-week rollout of these performance improvements

Our query rate at peak hours of the day dropped about 25%!

Takeaway

Automated testing is mostly thought of as targeting application behavior, where we assert that certain deterministic inputs yield a set of certain outputs. We often relegate runtime checks to further along the release process (often production, sometimes staging).

Instead, here’s a pattern to keep in your arsenal: lean on your existing test coverage as another runtime option. If something is easy to detect at runtime, try to instrument your tests to detect it before relying solely on production observability.

If these types of problems sound exciting to you, we’re hiring for many engineering roles at Samsara and would love to hear from you. You can see our open positions here.

I’d like to give a shoutout to Emma Ferguson & John Lee for their contributions to this infrastructure & post.

--

--