Parallel Query Processing and Optimization in DBMS

Published in

Plumbers Of Data Science

8 min readNov 30, 2022

Query processing is the process through which a Database Management System (DBMS) parses, verifies, and optimizes a given query before creating low-level code that the DB understands.

Query Processing in DBMS, like any other High-Level Language (HLL) where code is first generated and then executed to perform various operations, has two phases: compile-time and runtime.

Query the use of declarative languages and query optimization is one of the main factors contributing to the success of RDBMS technology. Any database allows users to create queries to request specific data, and the database then uses effective methods to locate the requested data.

The task of the query optimizer is to assess all possible ways to execute a query and choose the most effective one. This is done during the optimization process.

The development of advanced database systems has coincided with notable advancements in processing and distributed computing technology. Parallel database systems have emerged as a result of the addition of the next two features.

Data management and parallel processing techniques are combined in parallel database systems to offer scalability, high performance, and high availability for data-intensive applications.

Subject Matter

Parallel query processing in database systems has been the subject of a lot of research over the past two decades. Its conclusion is unquestionably currently being used as a component of significant commercial DBMSs. Shared nothing (SN) architectures were the focus of the majority of the study, such as the research prototypes Gamma and Bubba.

It is highly suited for the current grid and cluster computing and serves as the focal point for various symmetric multiprocessing architectures (SMP) strategies like XPRS and Volcano.

A database optimization approach based on CMP has been studied by numerous other academics. But the majority of their effort was on optimizing join operations while taking into account the L2-cache and the parallel buffers of the shared main memory.

By dividing a question into portions that may be executed in parallel, parallelism in a query enables us to parallel execute numerous queries. Shared-nothing architecture can help with this.

With the availability of more and more resources, like processors and discs, parallelism is also employed to speed up the process of query execution.

The following techniques can be used to make a query parallel

I/O parallelism
Internal parallelism of queries
Parallelism among queries
Within-operation parallelism
Parallelism in inter-operation

I/O parallelism

This type of parallelism involves partitioning the relationships among the discs in order to speed up the retrieval of relationships from the disc.

The inputted data is divided within, and each division is processed simultaneously. After processing all of the partitioned data, the results are combined. Another name for it is data partitioning.

Hash partitioning is best suited for point queries that are based on the partitioning attribute and have the benefit of offering an even distribution of data across the discs.

It should be mentioned that partitioning is beneficial for the sequential scans of the full table stored on “n” discs and the speed at which the table may be scanned. For a single disc system, relationship takes around 1/n of the time needed to scan the table. In I/O parallelism, there are four different methods of partitioning:

Hash partitioning

A hash function is a quick mathematical operation. The partitioning properties are hashed for each row in the original relationship.

Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4. The row is now stored on disk3 if the function returns 3.

Range partitioning

Each disc receives continuous attribute value ranges while using range partitioning. For instance, if we are range partitioning three discs with the numbers 0, 1, and 2, we may assign a relation with a value of less than 5 is written to disk0, numbers from 5 to 40 are sent to disk1, and values above 40 are written to disk2.

It has several benefits, such as putting shuffles on the disc that have attribute values within a specified range.

Round-robin partitioning

Any order can be used to study the relationships in this method. It sends the ith tuple to the disc number (i% n).

Therefore, new rows of data are received by discs in turn. For applications that want to read the full relation sequentially for each query, this strategy assures an even distribution of tuples across drives.

Schema Partitioning

Various tables inside a database are put on different discs using a technique called schema partitioning.

Intra-query parallelism

Using a shared-nothing paralleling architecture technique, intra-query parallelism refers to the processing of a single query in a parallel process on many CPUs. This employs two different strategies:

First method — In this method, a duplicate task can be executed on a small amount of data by each CPU.

Second method — Using this method, the task can be broken up into various sectors, with each CPU carrying out a separate subtask.

Inter-query parallelism

Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction processing is what it is known as. To support inter-query parallelism, DBMS leverages transaction dispatching.

We can also employ a variety of techniques, such as effective lock management. This technique runs each query sequentially, which slows down the running time.

In such circumstances, DBMS must be aware of the locks that various transactions operating on various processes have acquired. When simultaneous transactions don’t accept the same data, inter-query parallelism on shared storage architecture works well.

Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in DBMS.

Intra-operation parallelism

In this type of parallelism, we execute each individual operation of a task, such as sorting, joins, projections, and so forth, in parallel. Intra-operation parallelism has a very high parallelism level.

Database systems naturally employ this kind of parallelism. Consider the following SQL example:

SELECT * FROM the list of vehicles and sort by model number;

Since a relation might contain a high number of records, the relational operation in the aforementioned query is sorting.

Because this operation can be done on distinct subsets of the relation in several processors, it takes less time to sort the data.

Inter-operation parallelism

This term refers to the concurrent execution of many operations within a query expression. They come in two varieties:

Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first operation’s output before the first operation has finished producing the whole set of rows in its output.

Additionally, it is feasible to perform these two processes concurrently on several CPUs, allowing one operation to consume tuples concurrently with another operation and thereby reduce them.

It is advantageous for systems with a limited number of CPUs and prevents the storage of interim results on a disc.

Independent parallelism- In this form of parallelism, operations contained within query phrases that are independent of one another may be carried out concurrently. This analogy is extremely helpful when dealing with parallelism of a lower degree.

Execution Of a Parallel Query

The relational model has been favoured over prior hierarchical and network models because of commercial database technologies. Data independence and high-level query languages are the key advantages that relational database systems (RDBMSs) have over their forerunners (e.g., SQL).

The efficiency of programmers is increased, and routine optimization is encouraged.

Additionally, distributed database management is made easier by the relational model’s set-oriented structure. RDBMSs may now offer performance levels comparable to older systems thanks to a decade of development and tuning.

They are therefore widely employed in the processing of commercial data for OLTP (online transaction processing) or decision-support systems. Through the use of many processors working together, parallel processing makes use of multiprocessor computers to run application programmes and boost performance.

It is most commonly used in scientific computing, which it does by the speed of numerical applications’ responses.

The development of parallel database systems is an example of how database management and parallel computing can work together. A given SQL statement can be divided up in the parallel database system PQO such that its components can run concurrently on several processors in a multi-processor machine.

Full table scans, sorting, sub-queries, data loading, and other common operations can all be performed in parallel.

As a form of parallel database optimization, Parallel Query enables the division of SELECT or DML operations into many smaller chunks that can be executed by PQ slaves on different CPUs in a single box.

The order of joins and the method for computing each join are fixed in the first part of the Fig, which is sorting and rewriting. The second phase, parallelization, turns the query tree into a parallel plan. Parallelization divides this stage into two parts: extraction of parallelism and scheduling.

In Parallel Database, Query Processing And Optimization Approach

A huge task is broken down into numerous smaller tasks using parallel processing, which then runs each of the smaller tasks on various nodes and processors at the same time.

The greater task thus gets finished faster as a result. Separate jobs compete for the same resource in sequential processing. Only Task 1 can be completed immediately. Task 1 must be finished before task 2 may be started, and so on. Task 3 must follow suit.

A larger portion of the CPU is allocated to the jobs during parallel processing. There is no waiting involved because each autonomous task starts working right away on its own processor.

Concurrency Management, Task Synchronisation, Resource Sharing, Data Placement, and Network Scaling are qualities that a parallel database system should retain. Synchronicity occurs in is a vital component of success.

In a shared-nothing or loosely linked design, locking mechanisms are primarily used to synchronize the database. The same method is implemented for data placement in a strongly connected architecture. The load balancing of a parallel database system depends on the placement of the data.

The load balancing of a parallel database system depends on the placement of the data. Ideally, each parallel process should operate on a separate dataset to prevent interference between them.

By declusting (horizontal partitioning) the relationships based on the hash function or range index and assigning each partition to a distinct memory module, it is possible to create these independent datasets. The following is the suggested execution plan:

The first step is to reduce the overall amount of effort required to assess the query.

Then, make an effort to equally distribute that little work across the available processors.

Conclusion

In recent years, parallel DBMSs have become a reality.
They offer centralized DBMS capabilities in a multiprocessor system.
The performance expectations of a range of critical applications, which place high throughput demands on the DBMS, may be met only realistically by parallel DBMSs.
Parallel DBMSs must be built with particular regard for the protocols and techniques in order to achieve these needs.