Designing Data Intensive Applications — Partitioning

Published in

Geek Culture

10 min readDec 19, 2022

Introduction

Welcome to my 6th article in the series, Designing Data Intensive Applications. This article is highly inspired by one of the best Data Engineering books available out there: Designing Data Intensive Applications by Martin Kleppmann. In this chapter, we’ll be discussing the topic of data partitioning.

Note: I have received no compensation for writing this piece. Please consider supporting my and others’ writing by becoming a Medium member with this link.

Throughout this article, we’ll answer to the three following questions:

Why do we need data partitioning?
How does data partitioning impact the indexes that we built?
How does request routing work in your distributed system to account for partitioning?

Why do we need data partitioning?

In the previous article, we discussed replication — that is, having multiple copies of the same data on different nodes. For very large datasets, or very high query throughput, that is not sufficient: we need to break the data up into partitions, also known as sharding.

The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in our cluster. Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.

Designing Data Intensive Applications — Partitioning

Introduction

Why do we need data partitioning?

Written by Chouaieb Nemri