But I left out an important detail. To use the algorithm I presented, I required a random integer to be assigned to each Datastore entity. That’s a fine approach, but wouldn’t it be better to not require something like that?
So for the next article I thought I’d quickly explain ndb keys, then kick on with code that uses the space of keys as the keyspace and happily shards any existing ndb objects.
And then I got stuck for 3 weeks.
I figured out that I was stuck because this explain-ndb-keys thing is no simple task. Judging from all the documentation I’ve seen so far, they are utterly inexplicable. But to go anywhere interesting we need to know exactly how they work. So, let’s try.
What even is an ndb key?
The point of an ndb key is that every ndb model entity (ie: lump o’ data that you store in the GCP Datastore) has one. It is the unique handle for that model entity in the Datastore.
Unfortunately, an ndb key isn’t just a simple string or integer or whatnot. Instead, it’s a class encapsulating a triple of (application id: string, namespace: string, kind/id pairs: list of pairs of strings or string/int pairs). Even worse, some of that stuff is important and means things.
From the code comments:
The application id must always be part of the key, but since most
applications can only access their own entities, it defaults to the
current application id and you rarely need to worry about it. It must not be empty.
Or in more practical terms, just forget about this. It’s supplied automatically. I’ve never tried changing it, but I’m betting that you can’t.
It does interestingly imply that the Datastore is one monster multitenant system, where our keys are seriously globally unique across it, partitioned by our Application Ids. It’s interesting that we even get to see this detail.
Your application can be split into namespaces, which affects the datastore and maybe memcache? Namespaces are meant for multitenant apps, although you don’t need to use them. I don’t. I find they get in the way when I want to do operations across tenants. But you might use them. In that case, each of your model entities will be in exactly one namespace, and that will be represented in the key. It gets set for you automatically, so you can probably ignore it. In fact, the code comments say this:
The namespace designates a top-level partition of the key space for a particular application. If you've never heard of namespaces, you can safely ignore this feature.
The most important bit of a Key is the kind/id pairs. This is a list of pairs, the first element (kind) being a string, the second (id) being a string or integer. So what are these kinds and ids?
The various Google documentation and code comments say all kinds of well meaning things about kinds and ids; that kinds are class names, that integer ids must be assigned for you, that the various pairs represent a hierarchy of objects, that all the pairs except the last one constitute a parent key…
The documentation is so misleading that I would recommend mostly ignoring it. Instead, let’s strip this right back to the basics.
The kind/id pairs have the following structure:
pairs = [(kind0, id0), (kind1, id1), ..., (kindn, idn)]
Rule 1: There can only be one
That is, you cannot store two entities with the same key (ie: Application Id, Namespace, Pairs) in the Datastore. If you try it, the one stored second will overwrite the one stored first. No errors, no warning, just hulk smash. So be careful of this.
(kind0, id0) is Entity Group
The documentation talks about Entity Groups from time to time, in a way you’ll find yourself tuning out. But, Entity Groups are super important and you really need to understand them.
Let’s take a quick tangent off into what the Datastore actually is.
The Datastore is the public face of something called Megastore. Here’s a paper from when dinosaurs roamed the earth about Megastore. And here’s the abstract:
Megastore is a storage system developed to meet the requirements of today’s interactive online services. Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters. This paper describes Megastore’s semantics and replication algorithm. It also describes our experience supporting a wide range of Google production services built with Megastore.
Both strong consistency & high availability? Well, when you read on you find out that you don’t really get both things at the same time. And when they say fully serializable ACID semantics within fine-grained partitions of data, little might we realise that these fine-grained partitions of data are a thing.
They are Entity Groups.
Here’s a diagram from the paper:
Megastore is pretty amazing. It provides a way to have performant strong consistency across datacenters. You can see it in Figure 1; Entity Groups and Datacenters are orthogonal.
And see the ACID semantics within an entity group and Looser consistency across entity groups comments? We’re getting to the heart of the matter!
In Figure 2, we see a representation of two entity groups. We see that Most transactions are within a single entity group and Cross entity group transactions supported via Two-Phase Commit.
Where am I going? To this:
- Each entity in the Datastore (same thing as the Megastore) is in exactly one Entity Group.
- Entities in the same Entity Group are grouped together for transactional integrity.
- Entities in separate Entity Groups are not.
And most importantly for you, dear architect, is this:
Two model entities whose keys have the same
(kind0, id0)pair are in the same Entity Group. Two model entities whose keys have different
(kind0, id0)pairs are not.
If two entities are in the same Entity Group, they can be easily worked on in a strongly consistent way. BUT, but, they are super slow to operate on; there’s a guarantee of 1 write / second max. Basically your writes are being serialized and applied carefully inside one Entity Group. So you get consistency but not speed and certainly not scale.
If two entities are not in the same Entity Group, they can be written to and worked with concurrently, with eventual consistency, very successfully. BUT, but, to operate strongly consistently, ie: in a transaction, they need an expensive and slow two-phase commit process which is prone to failing due to contention. So you get speed & scale, but at the cost of consistency.
And the kicker is, you have to choose your Entity Groups at create time. When you create an entity, you provide the key, and that determines the Entity Group. Once you’ve
put() that entity, that’s it, it’s written in stone. So unlike a relational database, you need to architect your system with transactions vs concurrency in mind, and get it right from day 1.
Something else to keep in mind: when I talk about writes in transactions, and the documentation talks about writes in transactions, we really mean reads and writes. Oh yeah. A transaction cares about what is read as well as what is written, inside the transaction. You can use transactions purely to make sure you are getting a strongly consistent read from the Datastore; you might not even do any writes.
Now, the documentation (and the paper above) will hit you with models like this: you’ve got People, and Posts, in a blogging app. Each Post is owned by a People. You can structure their keys like this:
ndb.Key(People, <id0>, Post, <id1>)
Fred’s Post’s key:
ndb.Key("People", "fred", "Post", "fredpostid")
btw please don’t use natural keys as your ids, it’ll cause your keyspace to be too lumpy to shard. I use autogenerated ids, usually the string form of uuids to get a decent spread across the uuid space. This also guarantees uniqueness of any individual id which isn’t necessary but is often really useful.
This structure is fine, but it includes all the nasty implications about keys from the docs; that the (kind, id) pairs are somehow hierarchical, that keys have parents, that kinds are class names.
But your model mightn’t work that way. A person-and-their-stuff based hierarchy like this might not give you integrity vs concurrency in the right places. It’s worth a good hard think.
Also note that two keys with only one pair, like Fred’s key above, will never be in the same Entity Group. And, you can choose not to create a key for a model entity, in which case it will be generated for you, and will only have one pair. So, if you don’t choose your own keys, then all of your model entities will be in separate Entity Groups on their lonesome. Great for concurrency, but bad big league when you start wanting transactions. Sad!
Before we move on to Rule 3, let’s talk about Ancestor Queries.
Ancestor Queries are about Entity Groups
You’ll see stuff in the docs about Ancestor Queries. These let you query entities with a given key prefix; that is, with the same Application Id, Namespace, and where your Ancestor Key’s pairs are a prefix of (or the same as) the entities’ Key’s pairs.
eg: If you use
ndb.Key("People", "fred") as the ancestor in an ancestor query, you’ll match both
ndb.Key("People", "fred") and
ndb.Key("People", "fred", "Post", "fredpostid").
You’ll also notice that transactions require ancestor queries; no normal queries. This is a huge limitation, why is it so?
It’s because an Ancestor Key always contains at least one pair (ie:
id0), and so describes the Entity Group (check your tattoo if you’ve forgotten why). By requiring an Ancestor, ndb is forcing you to specify a single Entity Group for your query, which is actually crucial in a transaction.
All the other pairs in the ancestor besides
(kind0, id0) are unnecessary; you could use other fields as criteria instead. But that first pair, critical.
So, think about where you’ll need transactions before you create your data, and about how you’ll use Ancestor queries.
kindn is probably your class name. All other kinds are just strings.
Generally the kinds in your (kind, id) pairs are *not class names*, no matter what the docs imply. They are just strings.
kindn is special.
Datastore/Megastore doesn’t know about this (it doesn’t know anything about your classes), but the python ndb library uses
kindn to determine which class to construct when you get an entity from storage.
If we look in the code for Key, we can look at how
get() works. It calls
get_async(), which includes this line of code:
cls = model.Model._kind_map.get(self.kind())
That’s the Key class using
kind() (which returns
kindn), to look up a class in something called
_kind_map in the ndb Model class.
We know the keys in
_kind_map must roughly be class names, with the classes as values. If we look in model.Model, we can see how
_kind_map gets populated. Various places call this method:
"""Update the kind map to include this class."""
cls._kind_map[cls._get_kind()] = cls
See that the class method
_get_kind() is called to get the name of the key.
"""Return the kind name for this class.
This defaults to cls.__name__; users may overrid this to give a
class a different on-disk name than its class name.
As it says on the tin, it returns the class name, but you can override it if you want your class to use something different. My guess is that the only time you want to do this is when you want to change the name of an existing class in your codebase, but there’s also existing data and you don’t want to upgrade that. You would override
_get_kind() to return the old class name.
kindn is the name of the class to use when loading your entity from the Datastore, *unless* a model class has overriden
_get_kind() and returns
kindn, in which case use that class. If two model classes clash, I don’t know, one will win and end up in the
_kind_map, the other will be unreachable. So probably don’t do that.
And this bears repeating: the other kinds,
kindn-1, are just strings. They are just there to be monstrously confusing as far as I can tell (bar
kind0, which we need for the Entity Group). The only thing you have to worry about with these strings is that they help form a unique key, as per Rule 1.
Rule 4: idn can be supplied by the datastore.
If you leave out
idn or set it to None, the datastore will provide it. I think the Datastore does this, not the Python client library, although I could be wrong, I haven’t looked.
It doesn’t matter; all you need to know is that you’ll get a 32 bit integer, and that it’s not globally unique, but it will be chosen so as to make your key globally unique (ie: conform to Rule 1).
If you choose to provide
idn, you must provide a string.
idn-1, these are for you to provide. They can be strings or integers, it’s up to you. The documentation will tell you that you cannot provide an integer, but that’s only true for
idn-1, you do what you like.
So we’ve seen that an ndb key looks like this:
(Application Id, Namespace, Pairs = [(kind0, id0), (kind1, id1), ..., (kindn, idn)])
We know that we can ignore Application Id, and we can ignore Namespace (unless we’re using namespaces).
We know that
(kind0, id0) is the Entity Group, that we care about this, and why we care.
We know that
kindn must be our class name or whatever
_get_kind() returns if we overrode that.
We know that
idn must be a string if we provide it, or can be left out, in which case the Datastore will choose an integer for us.
We know that all kinds must be strings, and all ids must be strings or integers.
And that’s it. Those are the rules of keys. Don’t believe any silliness about parents and hierarchies.
What’s next: What’s next
All that background on keys is useful. But something you might not expect about keys is that we can ask “What’s next”.
ie: ndb Keys are ordinal. They have order.
That fact is vitally important if we want to use the space of Keys to shard over. But that’s for next time.