How does storage work in distributed systems?

PB
SystemDesign.us Blog
5 min readAug 24, 2022
https://docs.bmc.com/docs/discovery/contentref/files/997880678/997880682/1/1561987376212/DSM_Concept.png

Visit systemdesign.us for System Design Interview Questions tagged by companies and their Solutions. Follow us on YouTube, LinkedIn, Twitter, Medium, Quora.

In a distributed system, data is spread across multiple physical servers. This allows for high availability, data backup and disaster recovery purposes. The data is typically stored on commodity hardware, which is less expensive than enterprise storage solutions.

Each node in the cluster contains its own storage media, such as hard drives or SSDs. The data is then replicated across the nodes to provide protection against hardware failure. In most cases, distributed storage systems use a replication factor of 3, which means that each piece of data is stored on three different nodes. This provides high availability in case one of the nodes fails. When a file is created or updated on one server, the changes are replicated to the other servers in the system. This ensures that all servers have the most up-to-date version of the file.

Distributed storage systems are used to power massively scalable storage services like Amazon S3, and huge data pools in on-premise data centers.

What kind of data can be stored in distributed fashion?

There are three types of data that can be stored in a distributed system: files, blocks, and objects.

Files: A file-based system stores data as a hierarchy of files and folders. This is the most common type of storage system, as it is easy to use and familiar to most users. In a distributed file system, the files are spread across multiple physical servers. This allows for high availability and scalability. A popular example of a file-based storage system is Amazon S3.

Blocks: Block-based storage systems store data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A block is a unit of storage that cannot be broken down into smaller pieces. A common distributed block storage system is a Storage Area Network (SAN).

Objects: An object-based storage system wraps data into objects, identified by a unique ID or hash. This is a newer type of storage system that is becoming increasingly popular due to its scalability and flexibility. In an object storage system, the data is spread across multiple physical servers. This allows for high availability and scalability. A popular example of an object storage system is Amazon S3.

What are the advantages of using a distributed storage system?

High Availability: Distributed storage systems are designed for high availability. This means that if one node in the system goes down, the data is still available from the other nodes. This is because the data is replicated across multiple nodes.

Scalability: Distributed storage systems are highly scalable. This means that they can easily grow to accommodate more data as needed. This is because the data is spread across multiple physical servers.

Flexibility: Distributed storage systems are very flexible. They can be used to store a variety of data types, such as files, blocks, and objects. This makes them a good choice for many different applications.

Cost-Effective: Distributed storage systems are cost-effective. This is because they use commodity hardware, which is less expensive than enterprise storage solutions.

What are the disadvantages of using a distributed storage system?

Complexity: Distributed storage systems can be complex to set up and manage. This is because the data is spread across multiple physical servers. This can make it difficult to keep track of all the data. In addition, if one node goes down, it can be difficult to recover the data from the other nodes.

Performance: Distributed storage systems can have lower performance than other types of storage systems. This is because the data is spread across multiple physical servers. In addition, the data has to be replicated across the nodes, which can add to the latency.

Security: Distributed storage systems can be less secure than other types of storage systems. This is because the data is spread across multiple physical servers. In addition, if one node is compromised, the data on all of the other nodes is at risk.

How do I choose a distributed storage system?

When choosing a distributed storage system, there are several factors to consider:

What type of data do you need to store? If you need to store large amounts of data, then you will need a system that is scalable. If you need to store a variety of data types, then you will need a system that is flexible.

What are your availability requirements? If you need high availability, then you will need a system that is designed for high availability.

What are your performance requirements? If you need high performance, then you will need a system that is designed for high performance.

What are your security requirements? If you need high security, then you will need a system that is designed for high security.

What is your budget? Distributed storage systems can be more expensive than other types of storage systems. However, they are often more cost-effective in the long run.

What are some example of distributed storage systems?

Hadoop Distributed File System (HDFS): HDFS is a distributed file system that is designed for storing large amounts of data. It is used by the Apache Hadoop project.

Google File System (GFS): GFS is a distributed file system that is used by Google. It is designed for storing large amounts of data.

Microsoft Azure Storage: Azure Storage is a cloud storage service from Microsoft. It supports a variety of data types, including files, blobs, and objects.

Amazon S3: Amazon S3 is a cloud storage service from Amazon. It supports a variety of data types, including files, blobs, and objects.

What are some example of distributed databases?

Cassandra: Cassandra is a distributed database that is designed for storing large amounts of data. It is used by Facebook.

HBase: HBase is a distributed database that is used by Apache Hadoop. It is designed for storing large amounts of data.

MongoDB: MongoDB is a document-oriented database that is used by many organizations. It supports a variety of data types.

CouchDB: CouchDB is a document-oriented database that is used by Apache. It supports a variety of data types.

Conclusion

Distributed storage systems are a type of storage system that uses multiple physical servers to store data. They are cost-effective and scalable, but can be complex to set up and manage. When choosing a distributed storage system, you need to consider your requirements for performance, availability, and security.

Visit systemdesign.us for System Design Interview Questions tagged by companies and their Solutions. Follow us on YouTube, LinkedIn, Twitter, Medium, Quora.

--

--