Migrating Versioned Entities in Google App Engine

Automatic migration for ndb.Model classes

Julian Diaz
Brain Hacking at workZeit

--

Code for this article: https://gist.github.com/jdiaz5513/8912276

One of the major advantages of using a NoSQL type solution like Google App Engine’s Datastore is the fact that there is no schema enforced on your data. One of your User entities can have just a full_name field, while another can have first_name and last_name fields instead. Heck, you can even be real silly and give some users an integer as a last_name.

While this is a fantastic feature, it can also lend itself to become a major headache, especially when modeling data using ndb.Model classes. As an application evolves over time, it’s not unheard of to want to remove unused properties, add new ones, or even wish to rename existing ones.

With ndb.Model, adding properties is easy. The real problem starts to rear its head when you need to delete or rename existing properties.

The official recommendation is to temporarily change the model to subclass from ndb.Expando instead of ndb.Model, then iterate through every entity and delete the properties as an Expando. There’s two main problems with this:

  • Migrating all entities at once might be overkill. With huge datasets this is a real concern. It might make sense to only migrate on demand as needed, especially if you’re not sure whether or not a specific entity will ever be accessed again.
  • New code may be unusable until migration is done. All sorts of problems arise here. If your new code depends on the existence of certain properties, or that they’re intialized/copied properly over from old properties, you might have to shut down during a maintenance window to give yourself time to migrate all the entities before deploying a new app version. Of course you could (should?) make sure your new code can handle these scenarios, but that can get messy...

You might be thinking you can just ignore the old properties and just keep making new ones. There’s a problem, though: even if you remove the property from the ndb.Model class, it persists in the datastore! This can get expensive if you have a lot of dead properties lying around.

Here I’ll introduce you to some undocumented tricks in a special class that makes deleting and renaming properties much easier.

Use versioned entities, and check the version in a _post_get_hook

First step is to add a version property to the model class.

class VersionedModel(ndb.Model):
new_property = ndb.StringProperty()
version = ndb.IntegerProperty(name='v', default=1)

This lets us keep track of the version of this entity so we can tell if it needs to be migrated to a newer version. Next, we’ll add a migration function:

    def _migrate_old_version(self):
to_put = []
to_delete = []
# migration logic goes here
if 'old_property' in self._values:
self.new_property = self._values['old_property'].b_val
del self._values['old_property']
del self._properties['old_property']
self.version = 1
to_put.append(self)
return (to_put, to_delete)

In this case, we’ve got an old property (aptly named old_property) that we want to rename to new_property. This is where we head into undocumented territory, so bear with me...

NDB stores all property defintions and their values in two special dicts: _values and _properties. These will contain all properties fetched from the datastore, regardless of whether or not they’re defined in the model class. This is the data that is sent back to the datastore during put calls. Deleting the property from both of these dicts will effectively remove the property once you call put on the entity.

Grabbing a value from the _values dict is the tricky part. These are always stored in a _BaseValue wrapper, so we’ll need to use the b_val property to get to the real data. The value we get back from b_val will be different depending on what kind of property it originally was; ndb.JsonProperty, for example, actually returns a str in JSON format that you’ll need to decode yourself. Some experimentation in an interactive console should make it clear what you’ll need to do for your properties. In alot of cases (like this one) you’ll just be able to use b_val directly.

Because old_property is not defined in the new version of the model class, the only way to access it is through the _values dict. Subsequently using del to remove it from _values and _properties will make sure it’s gone for good. You can use this migration function to do all sorts of mutations and cleanup actions on your models.

I’ll get to the to_put and to_delete strangeness in a bit. For now, let’s add a _post_get_hook:

    @classmethod
def _post_get_hook(cls, key, future):
entity = future.get_result()
if entity is not None and entity.version < 1:
entity._migrate_old_version()

Here we hook into the entity after every get call by calling get_result on the future passed into the hook. There are cases where the future will return None, so watch out for that.

Once we have the entity, it’s a simple check on the version number and then we call the migration function if needed. Anything that makes use of the entity after this point will have a newly migrated entity that’s up to date and ready to go!

You’ll notice that put isn’t being called here. That’s important for performance; if you plan on making changes to the entity and calling put later, you’ll want to make sure you’re only calling put once. If this becomes a concern because the migration function itself is expensive, or if you need the new properties to be available for indexed queries, then there’s another fun thing you can do:

Use MapReduce to migrate versioned entities

I won’t get into detail here about how to use the MapReduce library (that’s a whole other article!), but I’ll throw up a quick code snippet that works extremely well for me in production:

def _VersionedModel_migration(entity):
from mapreduce.operation import ndb as op
to_put, to_delete = entity._migrate_old_version()
for e in to_put:
yield op.Put(e)
for e in to_delete:
yield op.Delete(e)

Use that function as the handler for a mapper operation (no need for a reduce step) and it’ll go through migrating all your entities in short order. This is where the to_put and to_delete lists come in handy; you can clean up unused entities as part of of the migration process.

Note: mapreduce.operation.ndb does not exist in the mapreduce library supplied by Google; I added that myself to introduce compatibility with NDB. Gist: https://gist.github.com/jdiaz5513/8911930

Careful with queries!

Special care needs to be made with queries when using this trick. As of 1.8.9, entities returned from a query do not call the _post_get_hook. There are ways to deal with this, fortunately.

One thing that may help tremendously, depending on the specific use case, is to only use keys_only queries. These queries are fast, and an ndb.get_multi call on those keys will return properly-migrated entities. There’s an important side effect to using this method: the entities fetched this way benefit from the ndb cache. The benefit depends on how often you write, but it can be a cost and time saver (read more: https://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118). It could also hurt performance, so do be careful.

Another strategy, crazy as it sounds, is to avoid queries when at all possible. This is good practice in general; if there’s a way to fetch an entity by key or ID you should always be doing that versus using a query. For example, looking up a user by email address shouldn’t be done by querying for the email address; it’s actually better to use two get calls, one on an index entity that contains an ID equal to the email address, and then on the user key stored in that index entity. Because of the way HRD works, queries across entity groups are not guaranteed to be consistent so they’re only really useful for searching through the dataset; you cannot count on data being up-to-the-minute unless you’re fetching it directly by key.

--

--

Julian Diaz
Brain Hacking at workZeit

Partner, lead developer, and mind-hacker extraordinaire for workZeit.