How Streak Tracks Down App Engine Datastore Bugs

NoSQL is great because it’s fast and scalable. One thing this scalability comes with is less safety with database references across entities. Let’s say we have a bunch of Dogs being walked by a DogWalker.

We’d like the data to be consistent such that a dog points to its dogwalker and the dogwalker points to all of the dogs it walks:

But maybe we actually see something more like the following, with an inconsistent pointer from Dog 2 to something outside the shown group:

A typical scalable database engine like App Engine’s Datastore will only look at each entity in isolation. Since the entities are technically fine in isolation, there’s no guarantee of consistency from the engine level.

The long-term solution is to have a rock-solid, atomic data access layer. Ours is already quite battle-tested at this point, but we’d still notice some rare problems. On examination we’d track down some symptom to a situation like the above, where for some reason a dog was roaming around loose. We could see the bad data. Unfortunately, we had no idea how the data got into that state.

We invented what we think is an interesting approach to tracking down problems across entities. We started keeping audit logs, logging every change to every entity. That may take up a lot of space depending on how much your data changes, but data storage is relatively cheap.

Let’s say we kept an audit log for each of the dogs listed above, they could look like:

We then spliced together the audit logs to conceptually be more like audit logs for the group of entities.

We made a further improvement to it: we added support for an indicator function that looked at the dogs and the dog walker, and noted whether a problem was present. We’ll mark lines where a problem develops or vanishes.

Now we look at the stack traces for those rows where a problem developed (and ones within 1 second if you suspect race conditions), and we pretty much inevitably find the problem. Much faster than reading every log for a set of objects by hand.

Our finished process is an automated variant of something like the following:

  1. Track down a symptom to bad data.
  2. Run an App Engine mapreduce search to find all groups of data with that problem, involving data objects created only in the last week. (This guarantees that the problem also developed in the last week and will be in our audit logs.)
  3. Picking a set of entities with the given problem, get the stored audit logs for all involved entities.
  4. Using the above process, find the web request or stack trace which caused the problem.
  5. (Optional) Repeat steps 3–4, to catch problems with multiple causes.

Having the audit logs in advance is what makes this approach work. We don’t have to go and collect data after finding the problems, we can be quick and responsive with access to historical data.

If you find our approach useful, please let us know in the discussion on HN.