We all know cross partition scan is less efficient, and it is expensive, but what exactly happens under the hood, so to answer that question I proxied through my application using Fiddler.
I have to switch to
Https mode which is less performance comparing to Direct TCP model, so that Fiddler can capture all network traffic, one of response headers indicates which physical partition request was operated against. After retried many times with different query on a collection that has 10 physical partitions (and for simplicity I left the
MaxDegreeOfParallism not set), it shows that the SDK always starts its searching at partition 0, if the result doesn’t meet searching criteria then it sends out another one to partition 1, and so on..until it finds all or reach the end, this means it will never have an even distribution RU consumption cross partition because partition 0 will get hit the most, partition 1 gets hit second most, and so on. Depends on where the data is located if data is all located in partition 0 then I would be lucky (cost most effective although still uneven distribution), but if they are located at partition 9, the SDK eventually will have to send out 10 requests in total to find it, and that is roughly 20–30 RUs.
So what could go wrong with an uneven distribution? It is not costs effective, because the RU consumption allowance pre-configured for a Azure Cosmos DB is the total consumption, which means each partition gets an equal portion with a simple formula
RU consumption per partition = Total RU consumption / Number of partitions
Let’s say you pre-configured the above collection to have 50K RU consumption then each partition will be capped at 50K/10=5K, so even the partition 9 barely gets hit it still takes a big chunk of 5K allowance which completely goes wasted, and the most hit partition 0 will most likely be overloaded under peak load and as a result it will return 429 — Too Many Request error, however, it might not be bad as it looks like, as the SDK has a built-in retry mechanism that will retry the failed query internally, so your application consumer might not see any difference even your Azure Cosmos DB metric shows 429 errors.
If you have
MaxDegreeOfParallism enabled (set to a positive number) the SDK will, instead of check-and-see, it then does pre-fetch by sending x (the value of the property) number of requests at a time to all partitions regardless where your data is located, and aggregates all responses internally, and this is done in a single call to
ExecuteNextAsync() method which is a bit confusing so I raised an issue on GitHub for clarification.
Above explains why cross partition scan is expensive and why you should avoid it as much as possible, but can we avoid it? In my humble opinion it can be challenging, it would require carefully picking your partition key and design your application to work with it, if your application is a API it would likely rely on its consumers to pass on partition key to you, so it can be a cooperative design challenge.