Key-Based Sharding | Implementation in Java

GANESH SHAH
3 min readMar 21, 2024

What is Sharding ??

Sharding is a type of database partitioning that separates large databases into smaller, faster, more easily managed parts. These smaller parts are called data shards. The word shard means “a small part of a whole.”

Key-Based Sharding :

Key-based sharding is one of the most simplest way of sharding a database with its own limitations which we will discuss later in this article.

The assumption / pre-requisite for key-based sharding :

  1. The number of database servers will be constant, i.e. shards will never grow in numbers.
  2. There will be no downtime for any database server. All servers will be up and running at any given time.

In this approach, each data record is assigned to a shard (or partition) based on a predetermined key value, such as a unique identifier, a range of values, or a hash of some attribute.

The key used for sharding can vary depending on the requirements of the application and the distribution of data. Commonly, the sharding key is chosen to evenly distribute the workload across shards, prevent hotspots (where a single shard receives disproportionately high traffic), and optimize query performance by localizing related data within the same shard.

Key-Based-Sharding

To calculate which shard an Id belong I am using the following hash function : f(x) = x % ( no. of shards + 1 ), where x is the ID / Primary key of DB table.

Let’s say we need 3 shards then, x%4 will provide us the designated shard for that ID,

1,4 ==> Shard 1

2,5 ==> Shard 2

3,6 ==> Shard 3

Implementation :

I am going to implement key-based sharding in the simplest way using java. I will be using multiple threads to simulate real-world scenario of hitting the database concurrently.

  1. Create a class that represents a shard. We will use its object to create independent shards.
package com.example.shardingimplementation.repository;


import java.util.HashMap;
import java.util.Map;



public class KeyValueRepository {

private Map<Integer,String> database = new HashMap<>();

public void insertData(Integer id,String name){
database.put(id,name);
}

public Integer getDatabaseSize(){
if(database == null)
return 0;
return database.size();
}
}

2. Implement Key-Based Sharding algorithm as below :

package com.example.shardingimplementation.algorithm;

import com.example.shardingimplementation.repository.KeyValueRepository;

import java.util.List;

public class KeyBasedSharding {
private static volatile Integer id = 0;
private Integer totalShards = 0;
private List<KeyValueRepository> listOfShards;

public KeyBasedSharding(List<KeyValueRepository> listOfShards) {
this.listOfShards = listOfShards;
this.totalShards = listOfShards.size();
}

public synchronized void insertData(String name){
id++;
Integer shardId = id % totalShards;
listOfShards.get(shardId).insertData(id,name);
}

}

3. Implement a client class to create shards and insert data in the shards concurrently with below code :

KeyValueRepository shard1 = new KeyValueRepository();
KeyValueRepository shard2 = new KeyValueRepository();
KeyValueRepository shard3 = new KeyValueRepository();
KeyValueRepository shard4 = new KeyValueRepository();

List<KeyValueRepository> listOfShards = List.of(shard1,shard2,shard3,shard4);

KeyBasedSharding keyBasedSharding = new KeyBasedSharding(listOfShards);

ExecutorService fixedExecutorService = Executors.newFixedThreadPool(5);


for(int i=0; i<100000;i++){
fixedExecutorService.submit(() -> {
keyBasedSharding.insertData("dummyDatabaseValue");
});

}

fixedExecutorService.shutdown();

Thread.sleep(1000);

for(KeyValueRepository shards : listOfShards){
System.out.println(shards.getDatabaseSize());
}

Result: I tried inserting 1 Lakh data concurrently and as expected all the shards got an equal number of data.

Limitations of Key-Based Sharding :

Key-based sharding, while enhancing scalability and performance in distributed databases, faces limitations. Uneven data distribution and hotspots can occur due to skewed or popular data. Complex queries involving multiple shards pose challenges, and ensuring data consistency across shards is complex. Managing shard distribution, rebalancing, and failures adds operational overhead. Determining shard sizes can be difficult, impacting resource utilization. Additionally, a limited key space may hinder scalability. Addressing these constraints demands careful design, monitoring, and optimization, alongside implementing mechanisms for data distribution, consistency, and fault tolerance.

--

--

GANESH SHAH

Passionate Java developer with demonstrated expertise in creating robust Java APIs solving business use cases.