How we made Django migrations work on Google’s schemaless Datastore
Here at Potato we’re the benevolent dictators of Djangae, an open-source project which allows you to run the wonderful Django web framework on Google’s App Engine platform. This lets us build web applications using Django, while never having to worry about the sites receiving large amounts of traffic or accumulating colossal amounts of data, as App Engine and its schema-less database (the Datastore) work together to eat it all up.
Django contains a database migration tool for managing schema and data changes, such as adding a new column to a table, and allowing you to perform (and track) these changes on your local, staging, and production databases.
App Engine’s Datastore is a schemaless database, which means that running schema migrations is a slightly bizarre idea, and in a lot of cases is unnecessary. But sometimes it’s necessary to populate new columns with default values, or to move data around, and so for this reason we have built support for migrations on the Datastore into Djangae.
However, doing so wasn’t entirely straightforward…
Migrations Take Time
The first difficulty is that Django makes the assumption that each migration is either “applied” or “not applied”; it doesn’t have a concept of “in progress”. This is actually a problem for normal SQL databases as well.
If you’ve ever written a Django migration to perform several separate operations on a table which contains a few million rows, and then tried to run that migration while a website is simultaneously using that table to serve 25 requests a second, many of which are causing data changes, then you may have experienced the following situation:
Your connection to the database gets dropped part way through the migration, but the first operation (i.e. first SQL statement) still completes anyway. Then you have to run the migration again in order to perform the second operation, but Django hasn’t marked the migration as applied, and has no concept of it being in progress or partially applied, and so it starts from the beginning and tries to perform the first operation again, causing an error and forcing you to run manual SQL statements in order to get yourself out of a pickle.
On a SQL database you can largely mitigate against this problem by simply ensuring that each migration only contains a single operation. That way, at worst, all you have to do is manually mark a migration as applied to get yourself out of the pickle. But it doesn’t mitigate you against the difficulties of trying to run schema changes or run large UPDATE commands while your website traffic is simultaneously causing data changes.
On the App Engine Datastore there’s no such thing as a schema, and no such thing as an UPDATE command. So to add or update a column you have to run a task to map over all of the rows and update each one.
When it comes to large amounts of data and large amounts of traffic, this is actually a good thing; the fact that you’re not trying to update all the rows at once means that you can run that task while continuing to serve your web traffic from the same table. When the task is only half complete, some rows in the table will have more columns than others, but the Datastore doesn’t care. This ability to change the wheels without taking the train off the tracks is a great asset.
But bringing Django’s concept of a migration being simply either “applied” or “not applied” together with the Datastore’s concept of “you’ll need to run a task to make that change and then you’ll have to keep on checking to see when it’s finished” is a little tricky. We required a bit of crafty, cunning creativity to coerce those contradictory concepts into cohesion. First though, we have to address another point, which is that you might not want a migration at all.
Bro, Do You Even Migrate?
Because the Datastore is schemaless, you can add a new field to a model and simply deploy the new code without adding a new “column” to the “table”; when Django encounters an object where the value is missing it simply uses the field’s default value, and will add the value to the DB row on save.
But if you want to query on the default value of that new column then you’ll actually need to populate the value into the table. This applies even if you’re querying for None/NULL; there’s a difference between a column containing None and a column not existing in the row at all. If you want to query a column for None then each row has to explicitly contain None in order to be returned. So sometimes you don’t need to bother with a migration, and sometimes you do.
Deciding whether to run a migration or not completely depends on your use case, so it’s not something which Djangae can decide for you. As a result, our solution is that when Djangae is told to perform one of Django’s migration operations it simply ignores it, and then we provide a separate set of operations which you can use to create the migrations which you actually want to perform.
However, we don’t discard Django’s migrations, as they also provide the model state history. We just ignore the schema changes which they try to perform.
So now that we’ve defined that division, how do we diffuse the differences between Django’s “done” and “not done” with the Datastore’s “doing data duties”? We need a way to track each individual operation within a migration, and to know whether each one is done, in progress, or yet to be started. Seeing as operations on the Datastore are best performed in background tasks, we essentially made the operation tracker and the task runner into the same thing.
Tracking Migration Operations
The custom operations which Djangae provides are largely similar to the standard Django operations, but where Django’s operations make a call to the
schema_editor, the Djangae operations make a call to our task runner.
Here’s what the code for Django’s
AddField operation looks like:
And here’s the code for Djangae’s corresponding operation,
AddFieldData which you may or may not want to use when adding a new field:
As you can see, the main behaviour of this is that it makes a unique identifier for this operation, then it asks the task manager if a task for this operation is already running, and if so it waits for it to finish, and if not then it starts the task.
One key here is that the
database_forwards method is idempotent; it can be called multiple times and will still only run the task once. This allows us to ignore the fact that Django doesn't mark the migration as "done" until all of the operations have successfully finished. If Django's
migrate command is run again before this migration is done, then Django will treat the whole migration as "not done", and will therefore try to run every operation in the migration again. But our task manager is keeping its own record of where it's got to, so is not fooled by Django's instruction to repeat things.
Talking of running the migrations, that brings us to our last topic, which is how we trigger these migrations when App Engine doesn’t have any actual “servers” that you can
Django expects you to run migrations from the shell, using the
django-admin.py migrate command. If you're running the migrations on your local development database then it all happens on your local machine. And if you want to run the migrations on your staging or production database, then you either need to
ssh into the server and run the command there, or set up Django to connect to your remote database from your local machine.
App Engine, however, doesn’t have any servers. It has ethereal “instances” which are automatically created and destroyed by the Google cloud infrastructure in order to run your application. None of them have a fixed IP address, and they don’t support
ssh access, so that option is out.
However, App Engine does give us is a remote API for the Datastore, so we can access our live database from a local machine. But performing large data migration activities over that remote API from a laptop wouldn’t be very cool, because what would normally be a single SQL statement would instead be a separate read and write instruction for every single row, transferring the data between the cloud and your laptop.
Instead, we take advantage of the fact that App Engine’s task queues are actually built upon the Datastore. In other words, the remote Datastore API also provides a remote task queue API. So from the shell on our local machine, we can queue the migration tasks onto the task queue of the actual App Engine application in the cloud. And we can then use the same remote API to check the progress of said tasks.
This gives us the rather nice behaviour that once you’ve triggered a migration from your local machine, you can then kill the migrate command with
ctrl + C, close your laptop lid, drop your laptop in a river, or turn your wifi off. The operation will continue to run in the cloud. You can later run the
django-admin.py migrate command again, and as described earlier, it will check on the progress of that operation and either trigger the next operation or mark the migration as completed, as appropriate.
The Future’s in the Cloud
Now that we’ve removed the need to have an uninterrupted connection to the database in order to run migrations, it seems that the next logical step is to be able to remove the local terminal from the equation altogether. There’s now seemingly no reason why we can’t create a Django web view, which would allow you to trigger the tasks without touching your terminal at all.