The Journey to Distributed Systems: Part 3— Distributed Storage

Williams Adu
The Andela Way
Published in
4 min readAug 8, 2019
Photo by Artem Gavrysh on Unsplash

Ah! Anytime we think of the old single master storage solution, we can’t think of any better solution than that. It's very simple and it works — a database that just sits on a single computer. Similar to people, when the database sliders are moved in certain directions, the database gets dizzier over time. That is very interesting to note. Anyway, let’s explore some of those scenarios.

Typically, there are more reads than writes in a lot of web applications. The traffic for reads can be very huge and scary in certain environments. Now, let’s constrain ourselves to a relational database. In the event of more reads, the single master database would run out of gas at some point. This means the database read operations would slower over time. In order to deal with this problem, the usual solution we deploy is Read Replication - a single master and most often 2 or more slaves. The idea is to push the read requests to the slaves and the have the master handle the write requests. It looks like a good way to solve the problem. Wait! it’s too early to celebrate. It should be noted that we have broken the centralized database system we started with and we now have deployed a distributed database. Ah, also, we may have broken consistency as well. In our centralized architecture, there was always a guarantee that after a write, an immediate read request would always return the new data. But it's different in our present state. We can’t guarantee that a write to the master would be immediately propagated to the slaves after a read request. We have an eventually consistent database in the end. Also, we can start thinking of more problems such as what if one of the slaves or the master itself goes down or there are network partition problems. As has been stated in earlier articles of this series, computers cannot be trusted and so its network connections.

Hmm! What if we move our slider towards the increase of write requests instead of reads? Usually, that problem is resolved through sharding. This is the technique of breaking down a database or database tables into smaller sets based on a certain key. This mostly comes with some implications. One of them is a change to our database model. Also, it should be noted that performing joins of different shards is a bit difficult. In a master-slave architecture, we may break consistency as well.

At this point, you may have realized that certain guarantees that we started off with a single master database have been lost as we moved to distributed database architecture. Also, don’t forget that our data model may change as well. To explain this further, consider a scenario where reads. What do you start with when you intend to fix read bottlenecks in your database. Most people think of indexing. True, that works. But what you experience bottlenecks even after indexing. Likely, this bottleneck is as a result of you joining data within your database. So you end up denormalizing in order to improve the efficiency of your reads. Essentially, you end up changing your data model.

CAP Theorem

It would be lovely if we could discuss distributed databases using the CAP theorem. To ensure consistency, we want the read operations to return the most recent data that has been written. For availability, we just want to ensure that both write and read operations work. Since we can’t trust computers and some of the nodes can just drop, we just want to ensure that we are partition tolerant, ready to deal with any network or node partitions.

We both know it, we can’t ensure all 3 constraints per CAP theorem. You can only have just 2 out the 3 constraints. Understanding the trade-offs would be very vital in making decisions about what architecture to go with or which database technology to use.

Summary

Distributed systems are awesome and solve very challenging problems. But not that it comes with a cost. As can be seen from the above examples, you end up losing certain guarantees. So what next? You might want to take a closer look at what the NoSQL databases such as MongoDB, Casandra, Neo4j, and Redis offer and how they behave under a distributed architecture. For advanced experiments, you might want to consider Consistent Hashing.

In Part 1 — Introduction

In Part 2 — CAP theorem

In Part 3 — Distributed Storage

In Part 4 — Distributed Computing

--

--