“Efficiency Unleashed: Streamline and Optimize Your Data with Partitioning”

Published in

Code Like A Girl

5 min readJun 7, 2023

“Efficiency Unleashed: Streamline and Optimize Your Data with Partitioning”

Partitioning at school is used to make solving maths problems involving large numbers easier by separating them into smaller units. A Similar principle of partitioning is implemented in the data field to enhance scalability. This involves deliberately dividing a sizable database into smaller segments.

The objective of partitioning is to distribute both the data and the query workload evenly across the nodes. Partitioning makes it possible to separate data into distinct groups based on specific values or fields. This enables easy identification of data belonging to different groups, which may be stored on different nodes. As a result, when querying the data, there is no need for a Full Table Scan, leading to expedited processing and improved speed.

Partitioning of the key-value data model can be achieved by following method:

The key-value data model is a straightforward approach where data is accessed based on its primary key. It can be likened to an old-fashioned paper encyclopedia, where you locate an entry by its title. As the entries are organized alphabetically by title, you can swiftly locate the specific entry you need.

Partitioning by key range involves assigning a continuous range of keys, spanning from a minimum value to a maximum value, to each partition. The range will be represented by a start and an end key. For example, key range 1 covers all the names from A to Bayes, excluding Bayes. The keys can be organized in sorted order for efficient access within each partition. The disadvantage of key range partitioning is particularly evident when using timestamps as keys. In this case, the partitions align with specific time ranges, such as one partition per day. Consequently, all write operations are directed to the current day’s partition, resulting in an overload of writes on that specific partition while the other partitions remain idle.

Partitioning by key range( borrowed from “Designing Data Intensive Application”)

Range partitioning should be used when your table has a continuous key, such as time. Let’s revisit our basic key-value store, which uses string keys and values. In this scenario, we are storing information about authors and their books. If we have prior knowledge of the distribution of author names, we can designate partition boundaries based on specific letters.

For instance, we can define partition splits at the letters ‘b’ and ‘d’ in this case.The start and end of the entire key range need to be specifically marked. We can use an empty string to mark the lowest and the highest key. The ranges will be created like this:

(“”, “b”] — — — — — — Covers all the names starting from a to b, excluding b

(“b”, d] — — — — — — — Covers all the names starting from b to d, excluding d

(“d”, “”] — — — — — — — — — — — — — — — — — — — — -Covers everything else.

Many distributed data stores employ partitioning by the key's hash to mitigate the issue of skew and hot spots. With a suitable hash function, each partition is assigned a range of hashes instead of a range of keys. Consequently, any key whose hash falls within a partition’s range will be stored in that specific partition. Hash partitioning is a method to separate out information in a randomized way rather than putting the data into groups. It can be used for situations where the ranges are not applicable, such as product ID, employee number, etc. One drawback of this approach is that the original order of keys, once contiguous, is disrupted as they become scattered across multiple partitions. As a result, the sort order of the keys is lost.

Now, let’s examine a basic database comprising employee records, each containing a name and a corresponding salary. Suppose we partition the database using hash functions, with the employee salary as the key. Due to the nature of hash functions, employees with identical salaries will be stored in the same partition. However, employees with similar salaries may end up in different partitions. The number of partitions employed influences the specific outcome, but the crucial point to note is that records that are considered “close” in terms of the selected key can now potentially be quite distant from each other.

**Hash partitioning**(Diagram borrowed from thedigitalcatonline.com)

The division strategies we have previously examined depend on a data model that uses keys and values. However, the scenario becomes more intricate when secondary indexes are included. The challenge of secondary indexes is that they do not align seamlessly with partitions. Two primary methods for dividing a database with secondary indexes are document-based partitioning and term-based partitioning.

We will learn about partitioning on secondary indexes in the next article. It is a detailed topic in itself.

You might be interest in this series where I’m introducing several important concepts that new Data Engineers should be aware of. The other topics I talked so far:

Replication lag

Data Replication

Enhanced Query Performance

Indexing

Scalability

Slowly Changing Dimension

Distinctions between CTEs, SubQueries and TempTables

Sharding and Partitioning

Partitioning Data

Thanks for the read. Do clap👏 and follow if you find it useful😊.