Data Partitioning

Bhaveshanandpara
4 min readMay 1, 2024

--

Hey everyone!

In this blog, we’ll explore data partitioning. This post is part of the System Design for Beginners Series.

If you’re completely new to this, my suggestion is not to overwhelm yourself with too much information. Understand what you can and move ahead ( let the future you suffer 😉 )

Data Partitioning

Data Partitioning means partitioning the database into parts 🤷‍♂️

Now, why should we partition the database ?

Improved Scalability

Think about it: a database is basically memory, which is based on some hardware, so as the data scales, we also have to scale the hardware.

Maybe increase 100GB storage to 500GB 😮‍💨( Vertical Scaling ) But scaling up is costlier.

So if we partition our database into small partitions, we can scale by adding more cheap databases. (Horizontal Scaling)

Improved Availability

If you have all of your data in a single database and that fails, your database will be unavailable. ( Single point of failure )

So if you divide your database into small partitions, even if one of them fails, the other databases will still be up and running.

So partitioning data will make database more available.

Improved Performance

Suppose your database has 1 million records and now you want to get a single record. It may go and process all 1 million records.

But in data partitioning, if we have divided the database into 10 partitions and somehow we know that we will find this record in the 4th partition (we will see how), we will save a lot of computation time, resulting in improved performance.

Vertical Partitioning ( Normalization )

We divide table based on columns.

In the above example, we had a table with following attributes:

Roll No., Name, Exam, English Marks, and Math Marks

But then we realized something: what if we partition this table into two tables? We club together relevant attributes like,

  • Exam, English, and math marks are related to the exam,.
  • Name are related to general information about the student.

Roll No. is a primary key in both tables.

Suppose we only need subject-mark data once a year, perhaps during the creation of a marksheet. In contrast, we often query data for general student information.

This approach allows us to enhance performance by primarily querying the smaller student table, which contains general information.

Horizontal Partitioning ( Sharding )

We divide table based on rows.

Range Based Sharding

If you have a large number of records and are using Roll No. as your primary key, you can efficiently partition the table based on the range of Roll No.

We will partition in such a way that each partition will have 50 rows.

Now, if I want to retrieve the data for Roll no. 54, I know it is located in partition 2.

Directory Based Sharding

We divide rows based on some Partition Key

Assume you have partitioned your database based on class.

All the data for Class 1 will be in shard 1, and for Class 2, it will be in shard 2. You are using Class attribute as your Partition Key.

Now, if new record is inserted into the database, it will send the record to the shard based on class value.

So now you have divided your database based on class, requiring less computation for querying.

One problem that may arise is that what if one shard has much more records than the other. It will create an imbalance so choosing a balanced partition key is crucial.

That’s all for today, folks. Catch up with this series later for more updates on System Design Basics from your host and dost — Bhavesh

If you found this helpful, feel free to like it and share your thoughts in the comments! ✨

Next up, we’ll dive into REST APIs

--

--

Bhaveshanandpara

I'm new to the tech world! I'm currently learning DevOps. Having just graduated, I'm working on creating resources related to DevOps for students and freshers.