Overview of data storage implications for distributed and big data computing

How to deal with the complexity of storing data for distributed applications

Published in

Hacking Analytics

8 min readDec 30, 2019

Storing data for distributed system can be a complex affair, there are multiple approach and nuts and bolts that needs tweaking to get data storage just right for specific applications. Choosing a storage approach has impact on performance, storage cost, redundancy, engineering complexity etc.. and these decisions need to be taken in the context of the application, its’ dataset and its’ usage patterns.

Storage Redundancy

There are two main ways to deal with storage redundancy, one approach is through data replication, the other is through what is called as “Erasure Coding”. Both methods have their particular advantages and disadvantages and are better suited to handle different needs.

Replication

The simplest approach to having storage redundancy is through data replication, where we would be storing an exact copy of the data should anything happen to the original. In computer storage, such as desktops and servers, this is usually been implemented through something called RAID 1, RAID stands for Redundant Array of Inexpensive Disks.

Overview of data storage implications for distributed and big data computing

How to deal with the complexity of storing data for distributed applications

Storage Redundancy

Replication

Written by Julien Kervizic