Parallel vs Distributed Databases Systems

Pratham
14 min readMay 6, 2023

--

Parallel Databases

Nearly all companies employ database systems to organize and preserve information. These technologies have made it easier to modify, retrieve, remove, add, and perform other operations on data. Various types of database systems have emerged to facilitate and expedite this process. One of the popular systems nowadays is known as “Parallel Databases”. In this article, we’ll delve into what parallel databases are and why they are essential. Keep reading to discover more!

Fig 1

What are parallel databases ?

Parallel databases are a type of database system that use multiple processors to provide fast and efficient database services. These systems are designed to increase performance by carrying out several operations simultaneously, such as loading data, creating indexes, and evaluating queries.

To achieve high performance, parallel database systems use multiple CPUs and drives at the same time. This allows them to process large amounts of data quickly and efficiently, making them an excellent choice for businesses and organizations that require fast and reliable database services.

Fig 2

How does parallel database work?

Here are some key benefits of using a parallel database system:

  1. Improved performance: Parallel databases can handle large volumes of data and perform multiple operations at once, resulting in faster and more efficient performance.
  2. Scalability: As your business grows and your database needs increase, you can easily scale up your parallel database system by adding more processors and storage drives.
  3. Enhanced reliability: Parallel databases are designed with fault tolerance in mind, meaning that they can continue to function even if one or more processors or drives fail.

Step 1: Data Partitioning

Parallel databases start by partitioning the data into smaller subsets. Each subset is stored on a separate node, which can be a separate CPU, a separate disk, or a combination of both. This allows the database system to work on multiple subsets of data simultaneously.

Step 2: Parallel Processing

Once the data is partitioned, parallel processing can begin. The parallel database system uses multiple CPUs to execute queries, load data, and create indexes in parallel. This means that several operations can be performed at the same time, which results in faster and more efficient processing.

Step 3: Load Balancing

To ensure that the workload is evenly distributed across all nodes, the parallel database system uses load balancing techniques. This means that the system monitors the workload on each node and distributes new tasks to the nodes with the least amount of work.

Step 4: Query Optimization

Parallel databases use advanced query optimization techniques to ensure that queries are executed efficiently. These techniques include parallel query execution, partition pruning, and predicate pushdown. By optimizing the queries, the system can process them more quickly and with less resource utilization.

Step 5: Fault Tolerance

Parallel database systems are designed with fault tolerance in mind. This means that if one node fails, the system can continue to operate using the remaining nodes. To achieve fault tolerance, the system replicates data across multiple nodes and uses advanced algorithms to detect and recover from failures.

Fig 3

Performance evaluation :

When evaluating the performance of a parallel database, two important factors are considered: speed-up and scale-up. Response time is used to measure the effectiveness of a parallel database.

Response Time: Response time refers to the total amount of time it takes for a request to be completed.

Speed-up: Speed-up refers to the ability to increase the resources available for a given task, which allows it to be completed more quickly.

Scale-up: Scale-up refers to the ability to maintain consistent performance when additional resources or processes are added. This can be calculated using the equation Vn/V1, where Vn is the time taken by n processors to execute queries and V1 is the time taken by a single processor to execute queries.

For example, let’s say we have 10 users running their CPUs at 100% efficiency. If we try to add more users, the efficiency of the system will decrease because a single CPU can only handle a limited number of users. To increase response time and maintain efficiency, we can add new processors to the system, which will provide a 200% efficiency boost.

Advantages of parallel databases :

Parallel databases offer a number of advantages, including:

  1. Speed: Parallel databases use a divide and conquer approach where a single request is split into multiple smaller requests, each of which is sent to a different computer for processing. This allows the system to process requests faster by spreading the workload across multiple machines. Once the requests are processed, the results are combined and returned to the user.
  2. Scalability: Parallel databases are highly scalable and can be easily expanded to handle an increasing number of requests. By adding more machines to the system, its processing capacity can be increased, allowing it to handle more requests simultaneously.
  3. Reliability: Parallel databases are designed to be highly reliable. If one machine in the system fails, the server can detect the failure and redirect the request to another machine in the cluster. This greatly reduces the likelihood of system failure, making the system more dependable.

Disadvantages of parallel databases :

Although parallel databases offer many benefits, they also have some disadvantages, such as:

  1. Cost: The implementation of parallel database systems can be expensive due to the need for multiple processors and disks to operate simultaneously. This can make it difficult for organizations with limited budgets to adopt this technology.
  2. Resource Management: The maintenance of parallel databases can be a complex process due to the constant need to update, modify, replace, or change resources. This requires a significant amount of time and resources, which can be challenging for organizations to manage.
  3. System Management: The management of parallel databases can be difficult and time-consuming. This includes managing resources, machines, processors, and disks, which can become more challenging as the number of systems in the cluster increases. Each system requires a reasonable amount of time to update if necessary, making it difficult to ensure that all systems are up-to-date and working properly.

Parallelism in databases can be broadly classified into two types :

  1. Intraquery Parallelism: Intraquery parallelism involves executing a single query in parallel on multiple processors and disks. This method is particularly useful for processing long-running queries more quickly. In this type of parallelism, the serial SQL query is broken down into smaller processes such as scan, join, sort, and aggregate, which are then executed concurrently.
  2. Interquery Parallelism: Interquery parallelism involves executing different queries or transactions concurrently. This method can increase transaction throughput, making it possible to support more transactions per second. The primary goal of interquery parallelism is to scale up transaction processing to improve overall system performance.

Architecture :

Parallel databases can have different architectures, each with its own advantages and limitations. The three primary architectures are shared disk, shared memory, and shared-nothing.

  • Shared Disk Architecture: In this architecture, multiple CPUs are connected to a network and share access to a common disk. Each CPU has its own memory and operating system, but they all access the same disk. This architecture allows for high data availability and efficient data sharing between CPUs.
Fig 4
  • Shared Memory Architecture: In shared memory architecture, multiple CPUs are connected to a network and share access to a common global memory and disk arrays. This architecture enables a single RDBMS server to access all CPUs and memory, providing a consistent system image to the client. However, it may become a bottleneck as the number of CPUs increases.
Fig 5
  • Shared-Nothing Architecture: In this architecture, each CPU has its own memory and disk storage, and there is no sharing of resources between them. Multiple CPUs are connected to an interconnected network, but each CPU does not have a separate disk area. This architecture provides high scalability, fault tolerance, and performance. However, it can be complex and expensive to maintain.
Fig 6

Applications :

Nowadays, applications require large databases that can reach hundreds of terabytes or petabytes, and parallel database systems are commonly used to address this need.

E-commerce applications require fast processing of many customer data requests, while data warehousing involves storing and manipulating vast amounts of critical information using parallel database systems. Additionally, parallelism is beneficial in data mining, where large datasets need to be processed efficiently.

Parallel databases are also utilized in online transaction processing (OLTP), where many concurrent transactions are performed on large databases, and online analytical processing (OLAP), which involves carrying out complex queries such as decision support questions.

Conclusion :

In conclusion, the development of parallel databases has transformed data storage and has revolutionized the way databases function. Major tech companies and enterprises have widely adopted parallel databases due to their effectiveness and notable speed advantages over traditional centralized databases.

Despite the few drawbacks, parallel databases have enormous potential to increase the number of operations and transactions performed. When we consider their numerous benefits, it becomes clear that parallel databases are a promising technology that will continue to play a significant role in data management and storage.

DISTRIBUTED DATABASES

In today’s world, data is generated at an unprecedented rate, and organizations need to efficiently store, manage, and process it. This is where distributed databases come into play. Distributed databases are a network of databases that are spread across multiple physical locations but work together as a single unit. This allows for more efficient data management, higher availability, and better scalability.

Distributed databases can be categorized based on their architecture, including client-server architecture, peer-to-peer architecture, and multi-tier architecture. They also provide various benefits, such as increased fault tolerance, better performance, and reduced network traffic. However, implementing a distributed database can be complex and requires careful planning to ensure data consistency and security.

Fig 7

What is Distributed Database ?

Distributed databases are a type of database system in which data is stored across multiple computers that communicate and coordinate with each other to function as a single system. This allows for improved performance, scalability, and fault tolerance compared to traditional centralized databases.

  1. A distributed database consists of multiple nodes or computers that work together to store, manage, and access data.
  2. Each node in the system can access and modify the data that it is responsible for, and can also communicate with other nodes to perform distributed transactions and ensure data consistency.
  3. There are several different types of distributed database architectures, including client-server, peer-to-peer, and federated databases.
  4. Distributed databases are commonly used in large-scale web applications, social media platforms, e-commerce sites, and other systems that require high availability, scalability, and performance.
  5. Examples of popular distributed databases include Apache Cassandra, MongoDB, Redis, Amazon DynamoDB, and Google Cloud Spanner.

How does Distributed database work?

Here is a step-by-step guide to understanding how distributed databases work:

  1. Data fragmentation: When data is stored in a distributed database, it is fragmented into smaller portions and distributed among different nodes. This fragmentation ensures that data is easily accessible and also reduces data redundancy.
  2. Data replication: Another critical aspect of distributed databases is data replication. This process involves making copies of data across multiple nodes to ensure that data is available even if one or more nodes fail.
  3. Query processing: When a query is executed, it is sent to the node that holds the relevant data. The node processes the query and returns the result to the user. If the data is not available at the node, the query is forwarded to the appropriate node.
  4. Transaction processing: Transactions in a distributed database can be classified into two categories: local transactions and global transactions. Local transactions are confined to a single node, while global transactions span multiple nodes.
  5. Distributed concurrency control: Distributed concurrency control is necessary to ensure that multiple transactions do not conflict with each other. This process involves coordinating the activities of multiple nodes to ensure that transactions do not overwrite each other’s changes.
  6. Distributed recovery: Distributed recovery is the process of restoring the database to a consistent state after a failure. This process involves detecting and repairing failed nodes, restoring data from backup copies, and synchronizing data across multiple nodes.
Fig 8

Types of distributed databases :

Distributed databases can be broadly classified into two types: homogeneous and heterogeneous. Homogeneous distributed databases consist of multiple databases that have the same DBMS and are managed by a central coordinating system. On the other hand, heterogeneous distributed databases consist of multiple databases that may have different DBMSs and are managed by a distributed coordination system.

Let’s understand the difference between the two with an analogy. Imagine a group of friends who all speak the same language and are managed by a single leader. This is similar to a homogeneous distributed database where all the databases are using the same DBMS and are managed by a central coordinating system. Now, imagine a group of friends who speak different languages and are managed by multiple leaders who communicate with each other to coordinate their actions. This is similar to a heterogeneous distributed database where the databases may be using different DBMSs and are managed by a distributed coordination system.

Now, let’s divide into the types of distributed databases within each of these categories:

Fig 9

Homogeneous Distributed Databases :

  1. Homogeneous Autonomous Distributed Database: This type of database consists of multiple databases that have the same DBMS and are managed by a central coordinating system. Each database can function independently and does not require communication with other databases to perform operations. Examples of homogeneous autonomous distributed databases include Oracle Real Application Clusters (RAC) and Microsoft SQL Server Cluster.
  2. Homogeneous Non-Autonomous Distributed Database: In this type of database, all the databases have the same DBMS and are managed by a central coordinating system. However, each database is not independent and requires communication with other databases to perform operations. An example of a homogeneous non-autonomous distributed database is the IBM DB2 pure Scale.

Heterogeneous Distributed Databases :

  1. Federated Distributed Database: This type of database consists of multiple databases that may have different DBMSs and are managed by a distributed coordination system. The coordination system communicates with each database to retrieve and manipulate data. Each database is still independent and has its own DBMS. An example of a federated distributed database is IBM’s Information Integrator.
  2. Multi-Database Distributed Database: In this type of database, multiple databases are combined into a single database system, allowing users to access data from all databases using a single interface. Each database may have its own DBMS and may be managed by a central coordinating system. An example of a multi-database distributed database is the OpenLink Virtuoso Universal Server.

Distributed data storage :

Distributed databases store data across multiple physical locations, making it necessary to develop effective strategies for managing data storage. Two common methods for managing distributed data storage are data fragmentation and data replication.

Fig 10

Data fragmentation is the process of dividing a table into smaller, more manageable pieces that are distributed across multiple locations. This can improve performance by reducing the amount of data that needs to be transferred across a network. There are three types of data fragmentation:

  1. Horizontal fragmentation: This involves dividing a table into multiple fragments based on rows. For example, a customer table could be fragmented by geographic location, with each fragment containing data for customers in a specific region.
  2. Vertical fragmentation: This involves dividing a table into multiple fragments based on columns. For example, a product table could be fragmented by category, with each fragment containing data for a specific product category.
  3. Hybrid fragmentation: This involves combining horizontal and vertical fragmentation. For example, a sales table could be fragmented horizontally by year and vertically by region.
Fig 11

Data replication, on the other hand, involves creating multiple copies of data and distributing them across multiple locations. This can improve performance by allowing users to access data from the nearest location. There are three types of data replication:

  1. Transactional replication: This involves copying changes to a table from one location to another as they occur. For example, if a customer places an order, the order information is replicated to all locations in real-time.
  2. Snapshot replication: This involves copying an entire table to another location on a regular schedule. For example, a product catalog might be replicated every night to ensure that all locations have up-to-date information.
  3. Merge replication: This involves copying changes to a table from multiple locations to a central location, where the changes are merged into a single table. For example, a sales table could be updated at different locations and then merged into a single table at the end of the day.

Applications of distributed databases :

  1. Social Media: Social media platforms such as Facebook and Twitter use distributed databases to store user data such as posts, likes, comments, and messages. This allows for quick and efficient access to user data, even during high traffic periods.
  2. E-commerce: Online marketplaces like Amazon and eBay use distributed databases to store product information, customer data, and transactional data. This enables these platforms to handle a large number of transactions simultaneously, ensuring seamless shopping experiences for their customers.
  3. Financial Services: Banks and financial institutions use distributed databases to store and process large volumes of data related to customer accounts, transactions, and credit histories. This helps them provide efficient services to their customers while maintaining the security and integrity of their data.
  4. Healthcare: Healthcare organizations use distributed databases to store and manage electronic health records (EHRs) of patients. This allows healthcare providers to access and share patient data easily and securely, leading to better patient care and outcomes.
  5. Logistics and Supply Chain Management: Companies in the logistics and supply chain industry use distributed databases to store and manage data related to inventory, shipping, and delivery. This helps them optimize their operations and provide better services to their customers.
  6. IoT and Smart Cities: The Internet of Things (IoT) and smart cities rely heavily on distributed databases to store and process data generated by sensors, devices, and machines. This helps in monitoring and managing various aspects of city infrastructure such as traffic, energy usage, and waste management.

In short, distributed databases are widely used in various industries that require the storage, management, and processing of large volumes of data. They offer a robust and scalable solution for handling complex data requirements of modern applications.

Conclusion :

  • Distributed databases allow for data to be stored across multiple locations and accessed from anywhere with internet access.
  • They provide high availability, fault tolerance, and scalability.
  • Homogeneous distributed databases have identical software and hardware while heterogeneous databases have different software and hardware.
  • Data fragmentation and replication are techniques used in distributed databases to improve performance and data availability.
  • Horizontal, vertical, and hybrid fragmentation are types of data fragmentation while transactional, snapshot, and merge replication are types of data replication.
  • Distributed databases are used in various applications such as e-commerce, social media, and banking systems.

Authors:

  • Pratham Adav
  • Ajay Gonepuri
  • Dhruv Kshirasagar
  • Sanket Disale
  • Vishal Gavali

--

--