System Designing basics- Designing a Distributed Storage Service in Cloud

Sidharth Purohit
8 min readFeb 6, 2022

Ever wondered how Amazon S3 operates? How do our files get automatically scaled as per our requirement? Let’s try to design a system where we implement a replica of a Distributed Storage like AWS S3 service or Azure Blob or Google Storage. I will begin by exploring how a real time system functions, that is by scaling as we get more users, countering all issues as we face along the way. First lets cover the very basics —

What is System Design?

The process of defining the components, modules, interfaces, and data for a system in order to meet specific criteria is known as system design. The process of building or altering systems, as well as the procedures, techniques, models, and methodologies required to do so, is referred to as system development.

What is a Distributed Storage System in a cloud?

A distributed storage system is a type of architecture that allows data to be shared among numerous physical servers and, in some cases, multiple data centers. It usually takes the form of a storage cluster with a data synchronization and coordination mechanism amongst cluster nodes.

We are here to design something similar to massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage. When we upload files on these systems, the system scales automatically and they are called scalable or elastic systems. This works on a pay-as-you-use model. We pay for the services, memory or storage as per our usage.

In this article we aim to design such a system. The focus is on real time system designing and the article acts as a good resource for System Design interview preparation.

What are the primary purposes of our system?

1. Durability — Our system should be highly durable, once uploaded the file should not get lost.

2. Availability — The system should be available without any delay for the end user.

3. Multi-tenancy — The system should support multiple people uploading different objects into it, without any difficulty. It should not be separately deployed for each and every customer. Each customer should be able to upload on their own console and should be able to see their own files. Therefore ideally, one system should handle all the customers.

4. Scalability — The system should be scalable, once uploaded the customer should not worry about the extensibility of a system, no matter how many files they keep on uploading.

5. Security — We should ensure that the APIs are working properly over https and content should be securely stored over our storing system. Using something like SSL.

Let’s Begin!

High Level System Design

Below is the approach from a basic standpoint to the complex and efficient system. Feel free to skip to the last section if you are comfortable with basic design problems and countermeasures.

1. A basic design #1: A very basic and naive approach can be using a single server with some APIs exposed, let’s assume it has some 50 TB capacity. Thus this server has an API part and a data storage part. So a simple strategy is a user creating a simple bucket and starts uploading all his files in that bucket.

THE PROBLEM here is — This system does not support parallel read and write operations, the APIs here are limited to handle only a certain amount of traffic, there is always a limit on number of Input/Output operations performed. So this system is not SCALABLE.

A basic system with one user and one server — with exposed APIs and Data Storage

2. Basic Design #2 :

To make the system scalable we make the system scale horizontally, where let us say have doubled the capacity for traffic handling. We can use a Load Balancer at the client’s end.

THE PROBLEM here is — This system will create an error, whenever the user tries to fetch a file which is not in a particular server. The system will respond that the file is not available. And thus this is not RELIABLE.

A slight scalable high level design to handle queries facilitated by a Load Balancer (LB)

3. Basic Design #3 : We can solve the reliability problem faced above by dividing our API part and the data storage part. So the API servers now act as a commodity machine (mostly have CPU and hard disk capabilities). We can add a meta-data service which ensures and redirects the user to the correct server wherever the file is present.

THE PROBLEM (and RESOLUTION )here is — This system will create an error, whenever the meta-data server fails, so it will require scaling. Now what if the data storage part fails, so we need it’s replication too. This whole allocation and replication process will be handled by this scaled Meta-data server, it can send heartbeat signals to the data storage servers and ensure the health of it, by allowing creating replicas and where to create too. We will create replicas in different regions so as if all of the servers in a region go down, the user should be able to get his data back.

A relatively scaleable and reliable high level design

Low Level System Design

What does the system ideally be doing?

User comes up with the file, and subdomain of the bucket. The request goes to the DNS server and the traffic gets redirected to the Load balancer. The load balancer then decides on the basis of its various API strategies about which specific API server to be picked. It could be on the basis of Node dropping or could be absolutely random.

API server now validates if the user with a specific api key is authenticated to perform the upload operation or not. The diagram below is divided into 2 Layers — Partitioning layer and Streaming layer each with their respective task.

So when the file comes to the API server, it checks from the partition layer which partition server should handle this request. The partition layer contains a partition map which deals with what mapping of files should go in which partition server. Partition layer helps in parallelizing and scaling, whenever a large number of files are coming from the user, the hashing mechanism with the partition layer will help in allocating specific files to particular file servers.

When a file comes to a server, the API server assigns a unique ID to this file. The hashed file number then gets allocated to a partition server. Each partition server has a range for handling a hashed file.

Once a partition server gets the file it talks directly with the streaming layer. Streaming layer contains many streaming servers, where each of these stream servers acts as a linked list of file servers. They will keep on adding new file servers from the head.

Streaming Layer :

This layer is responsible for getting a file from the partition server and storing it into the hard disk. The responsibilities of Streaming server includes — Appending new servers and files into the particular streaming server, sealing a particular partition server whenever its full, Implementing garbage collection on the file servers, replication of files(Streaming manager stores the information of which region other than the current have empty locations where the current file can be replicated.) and handling periodic checks on the server, checking the health of file systems.

Partition Layer :

Partition servers handle the mapping of the file and the file servers. And the partition manager is responsible for assigning the partition to a specific file using a range of partition ids. Partition manager also checks the load on partition servers. If any of the partition servers goes down, the manager automatically assigns a new partition server and updates the partition map table.

Cache Layer :

If a request for a file is repeated, instead of going inside the whole partition and file server, the file instance gets saved in the Cache layer.

We use zookeeper/lock manager as we know there could be multiple stream managers and partition managers running, we want only one of them to function at a time. To keep track of that, we use zookeeper to decide which one of the stream managers is the master.

Note : If anything happens to the primary cluster, all the data stored in partition database and the streaming database gets updated to the secondary cluster in some other region, using cluster manager.

The idea is, one of the cluster regions will handle all the writes, it also has the functionality to serve the read operations and the other two will specifically handle all the data replication, and serve for reading purposes.

So in the image attached below only region-1 has been shown, a similar copy will be there for region-2 and other subsequent regions. Only difference being, region-2 and others won’t have DNS server, Load Balancer and Cluster Manager. As they are specific to region-1, the primary entry point of the application, only.

As mentioned above, since we are dealing with clusters in various regions where the data and files are getting replicated, we will deal with the Cluster manager. Cluster manager is a program which is upper in the hierarchy and handles functioning of all the clusters in all the regions.

Low level design of complete end to end Storage system

Finally, the Cluster Manager has following functions :

1. User Account creation and bucket allocation — Whenever the end user creates a new account with our Storage application, they get redirected to the cluster manager, the cluster manager facilitates the user by allocating a subdomain with the bucket name and adding this entry in the DNS server with the IP address of Load Balancer.

Now Load Balancer(LB) here acts as an entry point to all requests coming to our distributed storage system.

2. Managing disasters and other hazards — Whenever the region -1 fails, the load balancer will direct the request traffic to other clusters and will let this system recover.

3. Resource tracking — Cluster manager also manages all the resources utilized, exhausted inside the clusters, how much space is left. Acts as an elastic storage.

4. Keep track of all the policies and authorizations, store it in the database. And manages overall flow of the cluster.

If you are still here, Hope you enjoyed reading this article!

:)

--

--