Learning CosmosDB as a newbie!
When we start learning anything new, the breadth of the subject may lead us to focus on random topics while we miss out on the most important ones. In this article, I share an approach that I applied to get started with learning the important aspects related to CosmosDB.
It’s all about money, honey
Although good documentation is available at CosmosDB docs pages, I have been an impatient learner who tries to get started as quickly as possible and then learn the concepts in-depth as I navigate through the real applications. Starting with CosmosDB was no different — I started learning by taking a quick look at the docs and some video tutorials but overall, the breadth of the content was overwhelming for a newbie like me, so how did I go about it? I started looking at how users get charged for using CosmosDB — if users are spending money on something, it ought to be an important setting.
Using Capacity Planner to get overall idea about important concepts
I headed to the CosmosDB Capacity Planner to identify the important settings that I must focus on first. While this may not be the recommended way and the understanding might have been broken at that point, this approach helped me in quickly getting an overall idea about CosmosDB. Capacity Planner helps users in identifying their throughput requirements, but I used it as a starting point in my CosmosDB learning journey. Below is a snapshot of the Capacity Planner (zoomed out to fit all fields in single page)
I went through each field and tried to find out each field’s significance. This is how I approached the whole thing —
API — This is a dropdown menu with 3 options. This means that CosmosDB lets users choose what API they want to use. These APIs refer to popular database technologies, so CosmosDB must be some multi-model system that allows you to benefit from native Mongo, Cassandra, SQL, etc. After looking at this field, I read about all the APIs in CosmosDB.
Number Of Regions — CosmosDB must have a setting to choose multiple regions. Why would someone want to choose multiple regions? May be to reduce latencies and ensure fault tolerance. This field led me to question why a database would need to be deployed in multiple regions.
Multi Region Writes — This option led to the confusion that when I have already specified more than 1 in Number of Regions, why do I need to enable or disable Multi Region Writes. I was under the impression that if we choose more than one region in the last setting, the data will be automatically replicated in all those regions, then why do I need to enable Multi Region Writes separately. So, understanding the need of Multi Region Writes became a non-negotiable requirement.
Consistency Levels — Consistency levels are important in any distributed system, so this setting made sense as CosmosDB is a globally distributed database service. I still needed to thoroughly understand the difference between the 5 consistency levels offered by CosmosDB and when consistency comes into play.
Indexing Policy — Indexing was not that alien a term. Indexing can speed up queries but can make writes costlier depending on how many properties are getting indexed.
Data Stored in Transactional and Analytical Store Use — These fields brought focus to the OLTP and OLAP workloads. I had to dig deeper to understand about what was so special about CosmosDB offering an Analytical Store. One of the main takeaways was that the data was available in Analytical Store (columnar store) as soon as it was written, without requiring any additional ETL jobs.
Variable Workload — Some workloads may have high volume steady traffic like continuous data collected from IOT sensors, while others may have bursty traffic. Understanding the nature of workload is important to provision required throughput.
Item Size — This setting hints that cost varies depending on the size of the items on which operations are performed. This was quite obvious but even more important factor to understand was the number of properties. Number of properties may become a very significant factor in the Write Costs if Automatic Indexing is on as all of the properties will be indexed then. In such cases, more the number of properties, more is the write cost.
Point Reads and Queries — Although I had never earlier thought about the distinction between Point Reads and Queries, it clearly was a major factor in overall cost estimation and hence and important concept to learn about. Point Reads highlight the importance of exact key-value pair fetch using the Partition Key and Item ID. This also led me to read about Partition Key in CosmosDB, which probably is one of the most important topics to learn about if you plan to use CosmosDB.
While there are several other important features and topics like continuous backup, restore, data ingestion, throughput allocation, hot partitions, throttling, etc., the concepts covered through the capacity planner was a new approach that I took this time to figure out what I must learn about a new technology, and it seems to have worked well. I’ll soon cover more topics in the same fashion and evaluate if it could be a generic way to get started!