Distributed Databases what, why, why not?

Nader Medhat
Nerd For Tech
Published in
7 min readJan 14, 2021

What is a distributed database?

Distributed database system is one in which the data belonging to a single logical database is distributed to two or more physical databases to inscure reliability and availability

you could distribute your data across multi nodes using many different system architectures

Shard Memory

CPUs have access to common memory address space via a fast interconnect.

  • Each processor has a global view of all the in-memory data structures.
  • Each DBMS instance on a processor has to “know” about the other instances.

Shard Disc

All CPUs can access a single logical disk directly via an interconnect, but each has its own private memories.

  • Can scale execution layer independently from the storage layer.
  • Must send messages between CPUs to learn about their current state.

Shard Nothing

Each DBMS instance has its own CPU, memory, and disk. Nodes only communicate with each other via a network. → Hard to increase capacity. → Hard to ensure consistency. → Better performance & efficiency.

Types of distributed databases

Replication

Database replication can be used to provide fault tolerance that has a database back end. Database replication is relatively inexpensive to implement. You don’t need specialized hardware or software. You just need to make sure that your database platform supports it. You also need to make sure that the application stores data in a way that is conducive to replication. In database replication, data is automatically copied from one database to another. In database replication, there is generally a publisher and a subscriber. The publisher is the database that holds the main copy of the data. The subscriber gets information by copying data from the publisher.

Sharding

Sharding is the process of breaking up large tables into smaller chunks called shards that are spread across multiple servers. A shard is essentially a horizontal data partition that contains a subset of the total data set and hence is responsible for serving a portion of the overall workload. The idea is to distribute data that can’t fit on a single node onto a cluster of database nodes. Sharding is also referred to as horizontal partitioning. The distinction between horizontal and vertical comes from the traditional tabular view of a database. A database can be split vertically — storing different table columns in a separate database, or horizontally — storing rows of the same table in multiple database nodes

Why using a distributed database

Increased Reliability and availability
Reliability is basically defined as the probability that a system is running at a certain time whereas Availability is defined as the probability that the system is continuously available during a time interval. When the data and DBMS software are distributed over several sites one site may fail while other sites continue to operate and we are not able to only access the data that exist at the failed site and this basically leads to improvement in reliability and availability

Better Response

If data is distributed in an efficient manner, then user requests can be met from local data itself, thus providing a faster response. On the other hand, in centralized systems, all queries have to pass through the central computer for processing, which increases the response time.

Modular Development

If the system needs to be expanded to new locations or new units, in centralized database systems, the action requires substantial efforts and disruption in the existing functioning. However, in distributed databases, the work simply requires adding new computers and local data to the new site and finally connecting them to the distributed system, with no interruption in current functions.

Less Data Movement over Network

The more replicas of, a relation are there, the greater are the chances that the required data is found where the transaction is executing. Hence, data replication reduces the movement of data among sites and. increases .speed of processing.

Smaller Databases are Easier to Manage

Production databases must be fully managed for regular backups, database optimization, and other common tasks. With a single large database, these routine tasks can be very difficult to accomplish, if only in terms of the time window required for completion. Routine table and index optimizations can stretch from hours to days, in some cases making regular maintenance infeasible. By using the sharding approach, each individual “shard” can be maintained independently, providing a far more manageable scenario, performing such maintenance tasks in parallel.

Smaller Databases are Faster

The scalability of sharding is apparent and achieved through the distribution of processing across multiple shards and servers in the network. What is less apparent is the fact that each individual shard database will outperform a single large database due to its smaller size. By hosting each shard database on its own server, the ratio between memory and data on disk is properly balanced, thereby reducing disk I/O and maximizing system resources. This results in less contention, greater join performance, faster index searches, and fewer database locks. Therefore, not only can a sharded system scale to new levels of capacity, individual transaction performance is benefited as well.

Database Sharding can Reduce Costs

Most database sharding implementations take advantage of low-cost open-source databases and commodity databases. The technique can also take full advantage of reasonably priced “workgroup” versions of many commercial databases. Sharding works well with commodity multi-core server hardware, systems that are far less expensive when compared to high-end, multi-CPU servers, and expensive storage area networks (SANs). The overall reduction in cost due to savings in license fees, software maintenance, and hardware investment is substantial in some cases 70% when compared to traditional solutions.

Why not using a distributed database

Distributed Database Not a Silver Bullet, There’s no single technology that can be the elixir to all your problems. The database realm is no different. If your data can fit on a single MySQL instance without too much pressure on your server, or if your performance requirement for complex queries isn’t high, then a distributed database may not be a good choice because :

Need for complex and expensive software

DDBMS demands complex and often expensive software to provide data transparency and coordination across several sites.

Processing overhead

Even simple operations may require a large number of communications and additional calculations to provide uniformity in data across the sites.

Data integrity

The need for updating data in multiple sites pose problems of data integrity.

Overheads for improper data distribution

Responsiveness of queries is largely dependent upon proper data distribution. Improper data distribution often leads to a very slow response to user requests.

Handling failures is a difficult task

In some cases, we may not distinguish between site failure, network partition, and link failure.

May cause much more network traffic

in case of a write operation in a replicated form of distributed database

Examples of distributed databases

Though there are many distributed databases to choose from, some examples of distributed databases include Apache Cassandra, Apache HBase, Couchbase Server, Amazon SimpleDB, and FoundationDB.

Apache Cassandra

offers support for clusters that span multiple locations, and it features its own query language, Cassandra Query Language (CQL). Additionally, Cassandra’s replication strategies are configurable.

Apache HBase

runs on top of the Hadoop Distributed File System and provides a fault-tolerant way to store large quantities of sparse data. It also features compression, in-memory operation, and Bloom filters on a per-column basis. HBase is not intended as a replacement for SQL database, although Apache Phoenix provides a SQL layer for HBase.

Couchbase Server

is a NoSQL software package that is ideal for interactive applications that serve multiple concurrent users by creating, storing, retrieving, aggregating, manipulating, and presenting data. To support these many application needs, Couchbase Server provides scalable key-value and JSON document access.

Amazon SimpleDB

is used as a web service with Amazon Elastic Compute Cloud and Amazon S3. Amazon SimpleDB enables developers to request and store data with minimal database management and administrative responsibility.

FoundationDB

is a multimodel database designed around a core database that exposes an ordered key valued store with each transaction. These transactions support ACID properties and are capable of reading and writing keys that are stored on any machine within the cluster. Additional features appear in layers around this core.

Conclusion

at the end of this article, those are some resources for reading and getting more knowledge and I will post more post related to this topic in the future

Distributed Database Systems (springer)

Introduction to Distributed Databases (CMU Database Group)

Understanding Database Sharding

Database Replication

--

--