Features and pitfalls of Azure Cosmos DB

7 min readJan 6, 2020

There are many database engines on the market today striving to provide high performance, scalability, and availability while keeping maintenance complexity and costs as low as possible. Cosmos DB is an excellent example of a database engine that can easily provide all these qualities. This article briefly describes most of its features along with some limitations, which may not be evident at first glance, but may cause tough problems in the future if not taken into account during initial system design.

Further in the article, some facts will be marked with special icons:

👍 Good features which may be very useful in practice.

⚠️ Some limitations to take care of, but if you are not aware of them and face them later in the development lifecycle, the problem can be easily fixed.

❌ Limitations that may appear to be a showstopper if they weren’t taken into account during the design phase.

Cosmos DB supports a few APIs: SQL, Cassandra, MongoDB, Gremlin, Table. Here SQL stands for the document API that used to be called DocumentDB, and it significantly differs from what regular SQL databases are. This article focuses on the experience with SQL API. Please note that you have to choose what API to use when creating a storage instance; there is no way to access the same instance using different APIs.

Cosmos DB document engine stores data in containers (formerly called “collections”), a single element in a container is called item (formerly “document”). All scalability, throughput, indexing settings are specified on container level (with some exceptions which will talk about later). A database is just a named “umbrella” for a set of containers.

And here goes the first limitation:

❌ The maximum size of an item is 2MB. Shouldn’t be a problem in most cases, but be aware that it is not infinite and storing media files inside items is not a good idea. More details on resource limits can be found here.

Scaling on-the-fly

Usually, with database engines, you can control their performance limits via underlying hardware or, in case of cloud solutions, via some aggregated metrics like DTUs in case of Azure SQL. In both cases, performance control is not granular enough, and there is a big chance to over-provision some resources. Cosmos DB allows you to set performance metric (throughput) individually for each container.

Throughput of a container is measured in Request Units per second (RU/sec), which is a relative unit. The baseline of 1 request unit for a 1KB item corresponds to a simple GET by ID of the item. For example, estimated necessary throughput for a container that you expect to be handling 500 reads and 100 writes per second for ~1KB records equals 500 * 1 + 100 * 5 = 1000 RU/sec. However, in general, it is practically impossible to precisely predict required throughput, since queries and record sizes may significantly differ.

The default minimum throughput for a container is 400 RU/sec (~25 USD/month).

👍 The request unit setting is configured on a container level and can be changed at any time without any additional deployment procedures.

👍 There is an option to configure throughput on a database level (in this case, all containers share the same throughput capacity).

👍 There is an autopilot mode currently in preview, which automatically and instantly scales the provisioned throughput based on your application needs. However, I would be very cautious with it, since poorly optimized application may eventually eat much money.

⚠️ In case a database exceeds the provisioned throughput while running a query — it throttles incoming requests by immediately throwing special HTTP status 429 (“Request rate too large”).

❌ To switch between container-provisioned throughput and database-provisioned throughput, the database must be re-created, there is no seamless way to do this.

Partitioning to achieve endless scalability

CosmosDB used to have two options for containers: partitioned and non-partitioned. Non-partitioned containers were limited in throughput by 10000 RU and by 10Gb — in size. Now it doesn’t offer the non-partitioned option, so upon creating a new container, you have to specify a partitioning key.

It is essential to understand the distinction between logical and physical partitions: a logical partition consists of a set of items that have the same partition key, a physical partition is a “processing unit” of Cosmos DB, which can handle multiple logical partitions. Depending on data size and distribution between partitioning keys, the engine may create new physical partitions and re-distribute logical ones across them.

❌ All provisioned throughput is evenly distributed across physical partitions. A container may start with one or more physical partitions. Let’s imagine there was only one created at the beginning. When its data size hits the 10GB limit, the engine will split the partition. It means that if at the beginning queries for any data could consume the whole provisioned throughput (e.g., 1000RU/sec), so after the split, all queries hitting the same partition will be limited to a half of the original value (e.g., 500 RU/sec).

❌ Even though the number of partitions is not limited, the maximum size of a physical partition is 10GB and its maximum throughput — 10000 RU/sec, so the partition key must be carefully selected to avoid hitting this limit. Please note that there can’t be more physical partitions than distinct values of the partition key field.

Auto-indexing

Instead of creating indices for dedicated fields or combinations of fields, Cosmos DB allows to setup indexing policy for paths inside an object. A policy represents a set of attributes: what property paths to include in indexing, what to exclude, what types of indices to use, etc.

The interesting thing is that unlike many other engines, Cosmos DB uses an inverted index instead of classical B-tree, which makes it very efficient when matching records by multiple search terms, and it doesn’t require composite indices for lookup by multiple fields. You can find more details on the internals of the indexing mechanism in the corresponding paper at http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf.

👍 Indexing policy can be changed on-the-fly.

👍 There are two indexing modes to choose from: consistent and lazy; lazy makes writes quicker, but hurts consistency between reads and writes, because indexing is performed in the background after the write is complete. While the index is being updated, you may get wrong results when querying for data.

👍 You can create composite indices to speed up ORDER BY by multiple columns, but in all other cases they are useless.

👍 There is spatial indices support.

⚠️ Default “index everything” policy may cause high RU consumption on persisting items having many fields.

Change feed

Change feed support in Cosmos DB works by listening to a container for any changes. It then outputs the sorted list of documents that were changed, in the order they were modified. The changes are persisted, can be processed asynchronously and incrementally, and the output can be distributed across one or more consumers for parallel processing. It makes this feature extremely useful for various integration scenarios.

⚠️ If using low-level API to read data from the feed, don’t forget to handle partition splits.

❌ The change feed does not track deletes, so it would be better to use “soft delete” to be able to detect removed records.

Stored procedures and transactions

Cosmos DB supports JavaScript stored procedures and triggers.

⚠️ JavaScript has purely asynchronous IO, so coding involves much callback usage and won’t be as readable as traditional SQL stored procedures. Async/await feature is not supported yet.

⚠️ No way to return additional error details from a stored procedure other than include them to the error message thrown and then parse the result on the client.

⚠️ Transactions work either inside stored procedures and triggers or as transactional batches available in Cosmos .NET SDK 3.4 and later.

❌ Only documents from the same logical partition can participate in a single transaction. Consequently, writes to different containers can’t be a part of the same transaction.

❌ Execution time for a stored procedure is limited (5 seconds), if transaction duration exceeds this limit, it will be rolled back. There is a way to execute long-running transactions with multiple calls and continuation tokens, but it violates atomicity.

Querying

Though Microsoft calls document API “SQL”, it only looks like SQL, there are many differences and limitations.

👍 There are API options allowing to prevent expensive queries from being executed: EnableCrossPartitionQuery, EnableScanInQuery (see FeedOptions in the documentation). Cross-partition queries occur when your “where” condition doesn’t have a specific value for partitioning key; scan occurs, when the criteria imply search by non-indexed fields. Setting both to false is a good safeguard against excessive RU consumption. However, in some cases, cross-partition queries may be needed.

👍 GROUP BY is now supported (was introduced in November 2019).

⚠️ Aggregate functions MIN and MAX seem to be not using indices.

⚠️ JOIN keyword exists in the language, but it is used to “unfold” nested containers, there is no way to join different documents.

⚠️ Since Cosmos DB is schema-less, a field may not even exist, so every query dealing with potentially non-existent fields should involve IS_DEFINED function, just checking something on NULL may not be enough.

You can find more hints on querying in the cheat sheets published by Microsoft.

Other useful features

Time-to-live — can be specified by default for every item in a container, or application can set TTL for each item individually.
Five consistency levels: strong, bounded staleness, session (default), consistent prefix, eventual.
Unique keys (note that changes in unique keys require re-creating the collection)
Optimistic concurrency checks — every item has a special “_etag” field maintained by the database engine, so that application code can force the engine to check that the item being updated has the same “_etag” value as in the data supplied for the update.
Geo-replication
Encryption at rest
Multi-region writes — allows to scale-out write operations, but bear in mind that multi-mastership of data always implies possible write conflicts: the same record is updated by two instances of the database simultaneously, multiple records with the same primary key are added by multiple instances, etc. Fortunately, CosmosDB provides two options to handle conflicts: automatic (last write wins) and custom, where you can implement your own conflict resolution algorithm.

As you may see from the above, Azure Cosmos DB has a lot of merits, which makes it a good choice for a variety of software systems. But nothing is perfect, and some limitations might appear to be a showstopper in certain cases: if your system requires transactions involving multiple containers or logical partitions, or requires long-running transactions involving hundreds and thousands of objects, if your data can’t eventually fit in 10GB and can’t be efficiently partitioned, etc. If none of the limitations mentioned earlier seems to be an issue for your system — Azure Cosmos DB is worth trying.

Special thanks to Illia Lubenets for his help with finalizing the article.