Partitioning data on Cosmos DB

Chris Wood

Published in

All Geek To Me

10 min readMar 11, 2023

When I first started using Cosmos as the data storage platform for my C# applications, life was pretty straightforward:

a container could contain 10GB of data without partitioning, or theoretically limitless data with partitioning;
a non-partitioned container was affordable to a small business like ours, and a partitioned database was eye-wateringly expensive owing to the minimum Request Unit (RU) levels.

Decision made — we wouldn’t partition data! We knew we’d never get near 10GB of data given the limited scope of our first few applications, which also wouldn’t be public-facing so wouldn’t receive heavy usage.

Then things changed.

The option to have a non-partitioned container went away, and the cost of partitioning became affordable — even compelling! Eventually, the maximum size of data you could store in a logical partition grew to 20GB too.

For some of our existing really simple applications, we were quite lazy in converting them into partitioned containers. We created a new property on our documents called partitionand set the value on every single record to DefaultPartition. This was quick and easy to do since the likely data size in gigabytes and the levels of usage hadn’t changed; they were always going to be simple, small, low-usage systems.

But we were also starting to design larger systems by that point that could potentially store tens of gigabytes of data and generate quite a bit more traffic. Thus began our quest to try and figure out how to partition Cosmos data effectively; and, well, it wasn’t straightforward.

Quick reminder: how partitioning works

There are two types of partition: logical and physical.

In choosing a partition key, you determine which logical partition your data sits within. The Cosmos platform then bundles various logical partitions together (or not) on physical storage partitions. You have no control over how Azure does that and don’t even need to care about it, provided you’re partitioning effectively at a logical level.

So we know we want to partition our data. So how do we choose what we’ll partition by?

Problem #1: overly-simplistic examples

Of course, the first thing we did was Google the heck out of the question. What we found at the time was lots of examples that did genuinely fulfil the main requirements of a good partition key:

the value should have a wide range of possible values (high cardinality);
the value should allow usage (i.e. RU consumption) to be spread as evenly as possible across logical partitions without creating “hot” partitions;
the partition value shouldn’t change.

But all the examples were deliberately simplistic (for very understandable reasons).

Microsoft guidance said, “you could partition by City!”.

My reaction was “but more people live in big cities, so you’d end up with a London hot partition whereas Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch would hardly ever be touched”. So that meets requirement number 1, but not 2 or 3 (since people can move). And, well, we weren’t dealing with towns or cities anyway.

Even today, the Microsoft guidance still uses cities, beef products, vegetables and “Soup, Sauces, and Gravies” as partition value examples.

Problem #2: forgetting that you’re going to want to use partition key values to actually find records

A lot of web-based articles seem to be quite academic about partition key choice. They’ll break down data into partitions by the type of record — a gravy or a beef product, for example — or some other factor that is purely based on the data schema in front of you. Logically, the value that is ultimately chosen as the partition key in these examples makes sense, but only in abstract.

What they don’t always consider is the purpose of the application and the business context in which it sits, however.

For example, say you’ve got a document that records flight information. Each instance of an actual flight taken will have its own document.

{
  "id": "135b85d9-dc78-4780-a573-91bba00b7fca",
  "flight_code": "VS045",
  "airline_code": "VS",
  "scheduled_departure": "2023-03-11:14:56:00.000Z",
  "actual_departure": "2023-0311:15:04:42.523Z",
  "departure_airport": "LHR",
  "arrival_airport": "JFK",
  "etc": "etc"
}

Based purely on the data attributes, you might say “we’ll partition the data by the airline code”. However, there aren’t that many airlines and you run the risk of exceeding 20GB of flight data for an individual airline in due course.

So maybe we’ll partition by flight code. Based on nothing more than the schema and the above criteria for a good partition key, that choice will tick all the boxes: it supports a wide range of values, you’re spreading the RU load effectively and the value won’t change. Job done, right?

Now let’s add some business context: the application’s primary use-case is to list all flights from every airline broken down by departure date.

Suddenly, your partitioning choice is not ideal: you’re always going to have to perform “expensive” cross-partition queries to pull that data for the system’s primary use-case together. These are where you don’t include the partition key in your query allowing a “point query”, leading to a scan of every partition to find data.

Which brings us back to problem #1: in which business system would you actually, ever, look up customers by city, primarily? Or by gravy? You just wouldn’t.

Problem #3: data is often abstract and has no “natural” partition key

Closely related to problem #1, most examples use records representing familiar or real-world objects. This makes total sense as they provide an easily-understood, common item that we can apply to our very abstract world of data storage.

However, not every document will be so lucky as to have easily identifiable partition key or even identifier values. How would you come up with an identifier or partition key for a fostering application or a trade of shares on the market?

In my experience, the norm has always been that we’re modelling abstract concepts in an abstract world of data storage: life isn’t as neat as the examples we see. We don’t have an immediately usable natural identifier or partition value to use that suits the business context, so we do the next best thing: generate unique identifiers randomly.

Problem #4: documents are often related, key propagation requires code overhead, and sometimes all you have is an identifier

This was one of my biggest mental blocks. In my head, I didn’t want to partition documents by a commonality that was ultimately irrelevant to how the data would actually be used (e.g. by document type).

Instead, it made sense (in our business context) to partition by families of documents — for example, the customer, their contracts, their points of contact and their bills should all sit in the same customer family and hence the same logical partition. Additionally, a customer has a contract for an electricity meter point. The meter point family includes documents about location information, serial numbers and other technical information, readings for each meter, etc.

The boundaries of the different families of document often intersect like circles on a Venn diagram:

Families of document types and how they intersect

One partition key option for the customer family of documents could’ve been a customer reference. This would’ve meant storing that value on every type of customer-oriented document. While this is certainly do-able, rippling down the same values between parent and child documents requires overhead in the application code.

It would also mean knowing the customer reference each time we looked up a value by its identifier. Given our application would be fronted up by a RESTful API with a URL like:

/api/v1/contracts/{contract-id}

that customer reference partition key value wasn’t going to be available to look up data in an inexpensive way unless all our APIs used composite IDs to the outside world, such as:

/api/v1/contracts/{customer-id}:{contract-id}

Further, how do you handle the situation where two different families of document intersect? You might want to look up the meter points associated with a customer (a customer’s tenancies), and all the customers associated with a meter point (a meter point’s tenants). One partition key can’t serve both families at once.

OK, enough with the problems! Onto the solutions we came up with for the usage challenges in our business contexts.

Solution #1: introducing the PUID

You’ve heard of a GUID — a Globally Unique Identifier — also known as a UUID. We came up with an identifier known as a PUID — a Partitioned Unique Identifier. Here’s an example of one:

a2720172_54d7_43d0_9e9a_fb9fcc4a7c4e

You’d be forgiven for thinking that’s just a GUID, but if you look very carefully, you’ll see it uses underscores instead of dashes. This is the give-away that we’re looking at a PUID and not a GUID.

By itself, it’s not much to look at, but look at the identifiers we’ve used for a family of documents:

a2720172_54d7_43d0_9e9a_fb9fcc4a7c4e - customer
a2720087_068f_4e3e_a294_e57e792e95f1 - contract
a2720b6c_1af8_47d0_97d1_e18b2759149f - customer bill 1
a27202e3_e2dd_4faa_b3ca_133adf8c7542 - customer bill 2
a272054c_dc18_4008_a37f_103b874ef63a - customer bill 3

All the identifiers are different but have the same first five characters. This does mean that generated PUIDs have a higher chance of collision than GUIDs, but we’re not really bothered about generating truly globally unique identifiers. It’s enough for the identifier to be unique within our business application.

As head of the family, the customer’s identifier is generated first and it’s completely random, just like a GUID would be. Child documents then inherit the first five characters of the their parent’s identifier in their own identifiers.

The partition key for this particular family is, therefore:

a2720

Using this approach means that:

if we’re trying to find the parent or child document of another document, we know exactly which partition it will be in — the same one as the document we’re already looking at;
if all we have is an identifier, we know exactly which partition it will be in from the first five characters of the identifier;
we’ve got high cardinality;
we can spread RU usage quite evenly across different logical partitions;
the partition key value doesn’t change;
any individual partition key can hold up to 20GB of data.

Solution #2: dealing with intersecting families

Going back to that real scenario we deal with in our own business domain where two different families intersect— customers and meter points — there are a couple of ways we could’ve dealt with this.

If we’re talking relationally, we’re dealing with a many-to-many relationship representing tenancies. One customer can have multiple meter points, but a meter point can have multiple customers over time (since customers move between properties).

This is solved by the customer’s contract storing references to the meter points used by the customer, and by the meter point storing referencing all the customers associated with it. In short, a two-way mapping.

At its simplest, this can be accomplished by the contract document having an array of meter point identifiers and the meter point document having an array of customer identifiers.

Alternatively, if there’s any danger that the array of values ever gets big enough to make your document size approach Cosmos’ maximum supported size of 2MB, you can create “pointer” documents. A pointer for a customer related to a meter point might be as short and sweet as this:

{
   "id": "ef63b54c_dc18_4008_a37f_103b874ef63a",
   "meterPointId": "ef63b6c_1af8_47d0_97d1_e18b2759149f",
   "customerId": "a2720172_54d7_43d0_9e9a_fb9fcc4a7c4e",
   "partition": "ef63b"
}

Getting a list of customers for a meter point is then a two-step process — get the Customer IDs from all the pointer documents related to the MeterPoint and load the customer documents — but on an RU consumption basis it could be less expensive than repetitive cross-partition scans for business operations that happen very frequently.

Solution #3: accept that not everything needs to be a tightly-focused Cosmos query

Sometimes you just can’t do something without looking at all partitions, or it’s not worth the investment of time or effort to make it more efficient because it happens so infrequently as part of a background job out-of-ours and without holding up end-users. That’s OK. The world will not end and certainly don’t sacrifice yourself on the altar of purism.

Also, sometimes the answer isn’t a Cosmos query at all. We index a subset of our documents using Azure Cognitive Search where we’ve got little ability to predict the criteria or scenarios users will be encountering when exploring our data.

One partitioning size doesn’t fit all

As I’ve emphasised above, business context and data usage scenarios play a massive role in how you design your partitioned data storage. As such, what’s worked for us may not be a good fit for you.

Going back to the flight tracking example above, it might be that partitioning by the date of departure would be a much better idea than anything I’ve proposed above regarding PUIDs. It all depends on what you’re doing.

Similarly, Microsoft’s own documentation does remind you that you could just use the document’s id as the partition value for very simple scenarios. For me, this would be scenarios that don’t include logical one-to-many scenarios between documents.

An alternative to pointer documents is to store duplicate documents against two different partitions too, though that does require the code overhead to do this.

Nevertheless, I hope all this lived experience will help others in navigating their way through the partitioning maze for the first time, and offers a few techniques that prove useful!