How to select a Partition Key in Azure Cosmos DB? Partition Key Strategies?

Chia Li Yun
Javarevisited
Published in
4 min readNov 12, 2021

In this article, we will discuss the factors that one has to consider in their choice of Partition Key. If you are still new to Azure Cosmos DB, I have shared another article on the 5 key information about it.

Photo by Sincerely Media on Unsplash

General Guidelines

Let’s go through some of the principle of a partition key.

1. Immutable Property

Be a property that has a value which does not change. If a property is your partition key, you can’t update that property’s value.

2. High Cardinality

The property should have a wide range of possible values and distinct. This will help to ensure that the logical partitions to be as small as possible which will allow more a spread of request unit (RU) consumption.

3. Even Spread of Data Across Logical Partitions

Similar to point 2, this helps to ensure an even RU consumption and storage distribution across the physical partition.

Other checkpoints to look out for

Does Not Cause Logical Partition to Exceed 20GB

Each logical partition is only allowed to grow up to 20GB.

Scope of Transactions are within a Single Logical Partition

If you need multi-item ACID transactions in Azure Cosmos DB, you will need to use stored procedures or triggers. All JavaScript-based stored procedures and triggers are scoped to a single logical partition.

Hence, if you use say the itemId as the partition key, it is not possible for you to do bulk import via stored procedures / triggers. However, it is not totally impossible, you may check out their Bulk Executor library that is available for both .NET and Java.

Provisioned Throughput is Evenly Distributed among Physical Partition

The throughput of a container is evenly distributed among the physical partitions. If your partition key is not distributing the data evenly in different physical partitions, there is a chance of some partitions becoming “hot” and rate-limiting (i.e. throttling ) may occur. This is important if you expect your total data size will require multiple physical partition.

A “hot” partition refers to having too many requests directed at a small subset of partitions.

Does your partition key gives you the smallest logical partition? With a smaller logical partition, you can prevent an unnecessary requirement of another physical partition whereby your provisioned throughput would have to be shared across. Diagram below demonstrates the difference as to how many physical partitions will be required based on the size of logical partition in an example of a total of 100 GB of data.

Any Cross Partition Queries?

It is important to identity the queries that you would require for your application. Likewise, if you have a huge amount of data, it is recommended that you avoid cross partition queries as much as possible. In such scenarios, cross partition queries will result in high RU cost.

Read / Write Heavy System

Overall, depending on the type of database you need, you may focus on the different conditions to meet:

  • Read heavy system
    - Partition Key exist in all if not most queries (to prevent cross partition queries)
    - Distribute concurrent queries
  • Write heavy system
    - Distribute concurrent writes across a wide range of value

Examples

Let’s go through 2 choices of the partition key choices using an IoT scenario.

  1. By IoTId
  2. By DayOfWeek
Example of the data

On the left, it appears that each IoT device has roughly the same amount of data. It will be good if your queries are always by IoTId that will help you to dive straight into the respective partition. However, you have to consider a plan if you foresee that the data of any device to exceed the maximum logical partition size (20GB).

On the right, using day of week, there might be issues if there happen to have more data on certain days (as shown in the 1st bar). This will result in uneven spread of data and hence “hot” partitions that we do not want.

None of my Property in its Isolation is Suitable be a Partition Key

You may look into using synthetic partition key. Synthetic partition key is the concept of combination of properties to form a partition key.

All you have to do is to include another property of that formation. E.g. /partitionKey = itemId + month

Conclusion

The above addresses most of the factors that one should consider for their decision on a partition key. You may not find a perfect partition key that fulfils all of them but as long as most of them are met (based on your application requirements i.e. most of your commonly used queries), it should provide you with the “best” performance.

Good luck and have fun with your journey in Azure Cosmos DB! 😁

Thank you for reading!

--

--

Chia Li Yun
Javarevisited

Recent graduate from university. Always excited about the new technologies and love to share with the tech community here!