Explained: Database Sharding

Sparsh Gupta
Nerd For Tech
Published in
7 min readNov 8, 2020

what do you understand from the term database sharding ??

Well, The word “Shard” means “a small part of a whole“. Hence Sharding means dividing a larger part into smaller parts. That means dividing the database into small databases. Well, actually yes but how ?? the answer is below.

Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. In Sharding, one’s data is broken into two or more smaller chunks, called logical shards. The logical shards are then distributed across separate database nodes, referred to as physical shards.

Need for Sharding:

Consider a very large database whose sharding has not been done. For example, let’s take a DataBase of a college in which all the student’s record (present and past) in the whole college are maintained in a single database. So, it would contain very very large number of data, say 100, 000 records.

Now when we need to find a student from this Database, each time around 100, 000 transactions has to be done to find the student, which is very very costly.

Now consider the same college students records, divided into smaller data shards based on years. Now each data shard will have around 1000–5000 students records only. So not only the database became much more manageable, but also the transaction cost of each time also reduces by a huge factor, which is achieved by Sharding.

Hence this is why Sharding is needed.

Types of Sharding :

1. Key based Sharding

Key based sharding, also known as hash based sharding, involves using a value taken from newly written data — such as a customer’s ID number, a client application’s IP address, a ZIP code, etc. — and plugging it into a hash function to determine which shard the data should go to. A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value. In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on. Altogether, the process looks like this:

2. Range Based Sharding

Range based sharding involves sharding data based on ranges of a given value. To illustrate, let’s say you have a database that stores information about all the products within a retailer’s catalog. You could create a few different shards and divvy up each products’ information based on which price range they fall into, like this:

3. Directory Based Sharding

To implement directory based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:

Directory based sharding is flexible as compared to range-based sharding and key-based sharding. Range-based sharding limits you to specifying range of values, while key-based sharding limits you to using fixed hash-based function.

Benefits of Sharding :

The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out.

  • Smaller Databases are Easier to Manage. Production databases must be fully managed for regular backups, database optimization and other common tasks. With a single large database these routine tasks can be very difficult to accomplish, if only in terms of the time window required for completion. By using the sharding approach, each individual “shard” can be maintained independently, providing a far more manageable scenario, performing such maintenance tasks in parallel.
  • Smaller Databases are Faster. The scalability of sharding is apparent and achieved through the distribution of processing across multiple shards and servers in the network. What is less apparent is the fact that each individual shard database will outperform a single large database due to its smaller size. By hosting each shard database on its own server, the ratio between memory and data on disk is properly balanced, thereby reducing disk I/O and maximizing system resources. This results in less contention, greater join performance, faster index searches and fewer database locks. Therefore, not only can a sharded system scale to new levels of capacity, individual transaction performance is benefited as well.
  • Database Sharding can Reduce Costs. Most database sharding implementations take advantage of low-cost open source databases and commodity databases. The technique can also take full advantage of reasonably priced “workgroup” versions of many commercial databases. Sharding works well with commodity multi-core server hardware, systems that are far less expensive when compared to high-end, multi-CPU servers and expensive storage area networks (SANs). The overall reduction in cost due to savings in license fees, software maintenance and hardware investment is substantial-in some cases 70% when compared to traditional solutions.

Drawbacks of Sharding :

  1. Adds complexity in the system: Properly implementing a sharded database architecture is a complex task. If not done correctly, there is a significant risk that the sharding process can lead to lost data or corrupted tables. Sharding also have major impact on your team’s workflows.
  2. Rebalancing data: In a sharded database architecture, sometimes a shard outgrows other shards and becomes unbalanced, which is also known as database hotspot. In this case any benefits of sharding the database is canceled out. The database would be likely need to be re-sharded to allow for a more even data distribution.
  3. Joining data from multiple shards: To implement some complex functionalities we may need to pull lot of data from different sources spread across multiple shards. We can’t issue a query and get data from multiple shards. We need to issue multiple queries to different shards, get all the responses and merge them.
  4. No Native Support: Sharding is not natively supported by every database engine. Because of this, sharding often requires a “roll your own”. This means that documentation for sharding or tips for troubleshooting problems are often difficult to find.

The main question Should you implement it on your System ?

Well , maybe or maybe not it depends on lot factor, Some see sharding as an inevitable outcome for databases that reach a certain size, while others see it as a headache that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding adds.

Because of this added complexity, sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:

1. The amount of application data grows to exceed the storage capacity of a single database node.

2. The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.

3. The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.

Before sharding, you should exhaust all other options for optimizing your database. Some optimizations you might want to consider include:

  1. Setting up a remote database.
  2. Implementing caching
  3. Creating one or more replicas
  4. Upgrading to a large server

Many big tech company uses this method for their Distributed system and many of them innovate this method to next extent part for eg. Google,facebook,Amazon & etc.

Google Spanner and HBase — Range Sharding

This type of sharding allows efficiently querying a range of rows by the primary key values. Examples of such a query is to look up all keys that lie between a lower bound and an upper bound.

You will saw many big tech companies using sharding architecture in their system designs.

However,Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application. Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others.

Further Reading:

and great explanation of data sharding by the Guru of system design a.k.a Gaurav Sen

I hope you Like it my article , I try to explain the database sharding in most minimal manner . I have started the series called “ Explained: System design” series for explaining the system designing in software development .

I hope you have good see you next Time !!

And please do follow me on Social media :

for twitter : https://twitter.com/Sparsh94749562

for Instagram : https://www.instagram.com/sparsh.gupta06/

--

--