In this article, I will explain the term in distributed computing called Partitioning so I will explain the following points
- What is Partitioning meaning?
- When do I need it?
- Explaining the kinds of Partitioning with advantages and disadvantages
- Secondary index and kinds of Secondary index
- Rebalancing and why I need Rebalancing?
- Different Rebalancing strategies with advantages and disadvantages
- Different ways for clients to know about which node contains target partition ( Request routing)
What is Partitioning meaning?
The term Partitioning simply coming from away that every part of our data such as records is owned by one partition so every partition can be considered a small DB.
When do I need it?
When our data became very large and it can not be stored on a single machine and the query to a large volume of data on a single machine became very hard and the scalability now is important the partitioning can solve this problem as a different partition can be placed on a different node on our cluster and the large volume of data can be distributed on many disks and our queries can be done on different processors as we can scale by adding more machines on our cluster.
Ways of Data Partitioning
As we explained we need to distribute our data on a small unit (Partition)
and to distribute the data and the query load equally across the nodes
so Actually this can be done in different approaches
This kind of partitioning is like having a simple key-value data model and always getting the value by the key so we can partition the data through the primary key but in many cases, this way can not be efficient as a partition key design that doesn’t distribute requests evenly can make some partitions have more data or queries than others and that called skewed and on the worst case all the data and loads wind up on a single only one partition and this is some use case to explain when this kind can be good or bad
- User ID, where the application has many users (Good Case)
- Item creation date rounded to the nearest time period (for example, day, hour, or minute) (Bad Case)
On other hand, we can assign a constant range of keys to every partition from N to N of keys and if we know the borderline between these ranges we can simply know which partition include our key or we can directly ask the node if we already know the which partition assigned to which node.
The ranges of keys do not surely have the same space because it depends on the partition boundaries. for example, if you have volumes and one of them contains words starting with letters A and B and another volume starting with V, X, Y, and Z it will end up with some volumes much bigger than others so if you want to equally distribute your data the partition boundaries need to adapt to the data.
The boundaries might be set manually or the database engine supports to set it automatically.
This strategy is used by HBase, BigTable, RethinkDB, and others.
The advantage of this strategy is you can keep keys sorted that range scans are easy and you can fetch several related records by treating the key as a concatenated index, for example (year-month-day-hour-minute-second).
The disadvantage of this strategy is the particular access patterns can end up with hot spot problems, for example, if the key is a timestamp then the partition correspond to ranges of time, one per day all data for today end up going to the same partition and the partition can be simply overloaded to avoid this issue you can rely on another key or by for example prefixing the timestamp with the machine name in case of sensors data for example.
- Hash of key
As we explained before the issue of the hot spot and skew a lot of distributed data systems use a hash function to determine the partition of a given key.
by the way hash function is any function that can be used to map data of arbitrary size to fixed-size values.
the data is distributed evenly and the skew / hot spot avoiding as much as the hash function is perfect. for example, if you have a 32-bit hash function that takes a string whenever you give it a new string, it returns an apparently random number between 0 and 2³² even if the strings are very similar the hashes are equally distributed across the range of numbers.
You can now assign each partition a range of hashes and every key whose has falls within a partition`s range will be stored in the partition.
regrettably, we lost a good feature of key-range that we can do efficient range queries as the key is now scattered across partitions so the sort order is lost as well for example in MongoDB if we enabled hash-based Partitioning mode any range query has to be sent to all Partitions.
As we explain before the partitioning schemes as we have discussed the key-value and we can only access the record through the primary key and we can determine the partition from this key so we can easily know the record in which partition but what can for example if we have data about car and the car_id is the primary key how we can get the red cars as some of them can be stored in different partitions. a secondary index is used commonly in relational databases and document databases as well but NoSQL key-value databases like HBase avoided it because it added implementation complexity but some stores have started adding it because so very useful for data modeling such as Riak. Actually, the problem with the secondary index is that it does not map cleanly to partitions so we have two kinds of secondary index partitioning document-based and term-based.
Consider we have an online system for selling cars and we are using a primary unique id and we partition our database using this key so for example car IDs from 0 to 500 in partition 0 and so on . now our customer need to filter the black cars or BMW cars for example here you need a scondary index on the color and brand these would be fields in document databases or columns in relational databases. if you have declared the index the database can perform the indexing automaticlly whenever the black car is added to the database the database partation automaticlly adds it to the list of primary keys IDs for the index entry. In this indexing type each partition is entirely separate as each partition maintains its own secondery indexes and it does not care what the data is stored in other partitions you only deal with the partition that contains the primary key ID or doucment ID. so if some of the black cars in partation and others on another partations and if you want to search for black cars you have to send the query to all partitions and combine all the results. this approche is sometimes knows as scatter/gather and it can make read queries on secondary index costly and it widly used in ElasticSearch and cassandraDB and others. most of database vendros recommend thay you structure your parationing sechema so that the seconday index queries can be served from a single paration but this is not always possible mainly when you are using multiple seconday index such as color and brand.
Instead of each partition having its own secondary index we can build a global index that covers the data in all partitions. A global index must also be partitioned individually the primary key index for example if we have black cars from all partitions appear under color:black in the index but the index also partitioned so that colors starting with letters A and K apprear in partition 0 and others in partition 1 and so on we call this type of index term-partitioned. so rather than make request to all partitions with scatter/gather cost now we just know the partition that match our term like car color starting from A and so on but the bottleneck there is the writes aee slower and more complicated because write to a single document may now affect multiple partitions.
I will exaplin the reblancing in the part 2 of the article.
- Design data intensive application
- Introduction to Reliable and Secure Distributed Programming