All about database sharding | Scaling up the database
Ever wonder how do Facebook, Instagram, Amazon handle their database? A platform like this have billions of user and you can imagine how data is generated in daily basis.
How do they scale up their database? The answer is database sharding. Before we understanding database sharding, let’s understand some type of scaling.
Type of Scaling
Suppose you have a hotel which has 5 floors and with time your hotel’s demand has increased, so you added two more floors in your hotel. It scales up the hotel vertically by adding floors in the existing building.
Similarly based on demand when resources to your server like CPU, RAM, DISK in the existing server you are scaling up in the vertical direction.
Suppose you have a restaurant and with time your restaurant’s demand has increased, so you opened a new branch in the same city. It scales up the restaurant horizontally by adding a new branch in the city.
Similarly based on demand when more servers are added parallelly you are scaling up in the horizontal direction.
What is Database Sharding?
Database sharding is an architecture pattern for horizontal scaling. It suggests making multiple partitions of the database based on a certain aspect. The partition can be two types vertical partition, horizontal partition.
In this partition, we divide the table based on the column. Suppose you have a user table with the column first_name, last_name, favorite_colour. So you divide the table and kept first_name and last_name in one database and favourite_color in another database.
In this partition, we divide the table based on the row. Suppose you have a user table with the column first_name, last_name, favorite_colour. So you divide the table kept the 100 rows in one database and 100 rows in another database.
Advantage & Disadvantage of Sharding
- Spread out the load: As there could be many machines parallelly running. The load can be distributed based on which system it needs to perform the query.
- Speed up query response times: As mentioned as the data will much less be compared to total data that actually might be there which automatically lead to faster querying.
- Highly scalable: Whenever needed you can add a new machine based on demand.
- More reliable: As your data is divided into multiple systems, if due to some reason any system goes you it won’t impact the whole system.
- Resharding is complicated: Once you sharded the data and after some time the demand increase/decrease and you wanted to add/remove one of the servers, then it will be complicated to reshard the data. As it has to rearrange the data based on the condition that was given to shard the data. And also costly to perform the operations.
- Shards eventually become unbalanced: Suppose you sharded the data based on month-wise. It might be possible that on some there might be far more data compared to other months. These could lead to unbalanced data on different devices
- Complicate to merge the shards: Once the database is sharded if difficult to return to unsharded architecture. And also very costly to perform the operations.
- Complicated to perform join operation: In a relational database, we perform join operations to get the relational data. If the data is distributed in multiple servers then it needs to perform join operations from all the server which very complicated.
Database Sharding Architectures
Algorithm/Key Based Sharding
In this approach, you select a column or combination of columns then you pass it into a hash function and you get a hash value. And you have 4 shards so to decide which shard you want to send the data for read or write operation. So to decide it make a modular operation based on that value you can decide on which shard you want to redirect to.
Resharding data can be challenging. It requires updating the sharding function and moving data around the shards. Doing both at the same time while maintaining consistency and availability is hard. Better choice of sharding function can reduce the amount of transferred data. Consistent Hashing is such an algorithm.
In this approach, we set a category for each shard. The strategy is to divide the collections of data into chunks and group them into the same shard. These categories could be anything region, age etc.
One drawback of is approach is that its configuration is stored in a separate database called a lookup table. If it goes down due to some reason then the whole system will be down. To tackle this scenario you can have a replica for backup.
You can use this approach when you want to distribute the data based on some category like region.
Range Based Sharding
This is a strategy to make a shard based on range, which could be the range of dates, primary key or something else. You might think it’s the same as dynamic sharding. It’s similar to that but this lookup table is not required, as these ranges are set in a logical pattern.
One drawback of this approach is that there will imbalance of data. Suppose store some sales data in the database. On festive month there will be more sales data comparative to another month. Which will automatically lead to imbalanced data in shards.
You can use this approach when you want to query the data based on range.
Entity Group Based Sharding
In this approach, we shard the data based on their entity group. Suppose you shard the data based on user data but there might be data that are related to that user. If we keep the relational data in another shard then it will be complex to perform join operations. To overcome this we keep all the data related to that user in the same shard. Which make it simple to perform join operations also optimise the query performance.
This strategy is more suitable for the relational database. It makes it simpler to perform join operations to reduce query time.
No technology is perfect, everything has its own advantage and disadvantages. It’s on your requirement which approach will be more appropriate.
You can also go for the hybrid approach, let’s understand by example. Amazon has a huge infrastructure all over the world. How could be scale the database of Amazon?
The first level will be dynamic sharding first division will be based on the country the second division will be based on the city.
The second level will be based on key-based sharding here all the initial data like user signup data could be managed.
The third level will be entity group based sharding all user-related data will be stored in the same shared.
Making a hybrid approach sometimes also reduce the drawback and increase the advantage of the overall architecture.