An Introduction to Parallel Computing

“Parallel computing” stands for the ability of computer systems to perform multiple operations simultaneously. The main driver behind parallel computing is the fact that large problems can be divided to smaller ones which can be then solved in parallel — i.e. executed concurrently on the available computing resources.

Introduction

The quest for squeezing more performance per silicon has been going on since the dawn of computing; from the deep execution pipelines in processors in the early 2000s, from the multicore SoCs found in mobile phones of today, to the General Purpose Graphics Processing Units that will power AI applications in the future. Parallel computing can increase performance, decrease wall-clock execution time, decrease performance-per-watt and save energy and resources.

Parallelism can be achieved in different forms, each bounded by the technology limits of its era. Computer engineers have been trying to utilize as much parallelism as possible directly on hardware since the early 1970s. This approach helped the parallelism process to remain transparent to the software developers, thus allowing them to follow the same programming paradigms for decades while the performance kept improving.

However, physical limitations on silicon CMOS technology are preventing further relying on low-level hardware-based parallelism techniques. Scaling down transistor size is top priority for chipmakers; but new challenges arise as the CMOS devices reach deep in the nanoscale regions:

Process variation — the natural random variation of transistor attributes in manufacturing — becomes extremely important at smaller scales (<65nm CMOS technology). These variations can be now a major percentage of the intended transistor features, thus altering the intended transistor’s attributes. Thus, process variation can cause variation in output performance of the CMOS devices.

Reliability issues arise as well. Nanoscale CMOS devices have increasingly more chances to fail, as the downsizing of transistor features — length, width, oxide thickness — and the attempts to drive down operational voltage both negatively affect their predicted lifetime. Naturally occurring phenomena that lead to failure are well-known since the 1960s; however their effects become more and more prominent as CMOS technology scales down. Chipmakers and researchers are now studying these phenomena and how we can predict and improve the expected lifetime for any given CMOS device.

Taking the above into consideration, application developers should now turn to the multi-threading parallel programming paradigm and new emerging computing technologies to achieve higher performance and lower power consumption for their application needs.

Types of Parallelism

Bit-level Parallelism: This form of parallelism is based on doubling the processor word size. Increased bit-level parallelism means faster execution of arithmetic operations for large numbers. For example, an 8-bit processor needs two cycles to execute a 16-bit addition while a 16-bit processor just one cycle. This level of parallelism seems to have come to an end with the introduction of 64-bit processors.

Instruction-level parallelism (ILP):This type of parallelism is trying to exploit the potential overlap between instructions in a computer program. Most forms of ILP are implemented and realized on each processor’s hardware:

Instruction Pipelining —Execute different stages of different independent Instructions in the same cycle thus making full use of idle resources.

Superscalar Processors — Utilize multiple execution units in a single processor die, thus following instructions can be executed without waiting on complex preceding instructions to finish executing.

Out-of-order execution — Instructions not violating any data dependencies are executed when a unit is available even when preceding instructions are still executed.

Speculative execution / Branch prediction — The processor tries to avoid stalling while control instructions (branches — i.e. if or case commands) are still resolved by executing the most probable flow of the program.

Most processors use a combination of the above ILP techniques. For example, Intel Pentium series used all of the above techniques to achieve higher performance with each product generation. Static ILP parallelism on the other hand, can be achieved on software level by specialized compilers which are used by the specialized Very long instruction word (VLIW) processors. VLIW processors have multiple execution units organized in multiple pipelines and require the compilers to prepare the parallel instruction streams to fully utilize their resources.

Task / Thread-level parallelism:This high-level form of parallel computing is focusing on partitioning the application to be executed in distinct tasks or threads, that can be then executed simultaneously on different computation units. Threads can work on independent data fragments or even share data between them.

The developer has to use the multi-threaded programming paradigm in order to fully utilize the capabilities of the nowadays available multicore processors. This new programming paradigm requires a shift of mentality while programming new applications, which is somewhat difficult, as until recently programming was done sequentially (a single-thread holding the whole application).

Modern Operating Systems have been used to provide a transparent utilization of all of the available cores of modern multicore processors by scheduling different processes on different cores. However, this is not the case for executing efficiently complex bioinformatics applications, because the computation load of each process cannot be distributed on the available cores by the OS. As such, these applications must be re-developed with multi-threading in mind in order to achieve higher level of parallelism though thread-level parallelism and improve their performance.

Data parallelism:A form of high-level parallelism that in contrast to thread-level parallelism tries to partition the data to different available computation units instead. The cores execute the same task code over the data assigned to each. This form requires advanced developing skills to achieve, as the code executed should be executed independently on each data fragment. In addition, it is applicable to specific problems only.

Data parallelism is the only available option for high-level parallelism in computer graphics because the graphic processor units (GPUs) are designed to execute each graphic processing task as fast as possible by partitioning each frame in regions. The task on command is then executed independently on each data region by their hundreds of processing units.

Limitations of Parallel Computing

Computer engineers keep pushing the limits of available parallelism by designing architectures with more resources and more processing cores — there are plans to build CPUs with 1000 cores; however, no actual benefit will be yielded even in the case that a 1000-core processor is available tomorrow. Moving to hundreds and thousands of cores requires a radical rethinking in how software is designed. Software is the actual roadblock that limits the full utilization of the available computing resources of today as well.

The industry has moved towards a clear thread-level parallelism model since the early 2000s, however the existence of legacy software and the lack of parallel programming training for developers have not allowed for high-end computation systems to fully embrace the available parallelism. Algorithms used today have mostly been designed and implemented while using the traditional sequential way-of-thought. As such, these implementations fail — in most cases — to achieve the best performance possible on the available state-of-the-art computing technologies. Attempts to parallelize these algorithms are bounded by Amdahl’s law.

Amdahl’s law: Amdahl’s law is actually modelling the expected improvement of parallelized implementations of an already existing algorithm implementation. The execution time of the parallel version is limited by the percentage of the sequential (non-parallelized) code. For example, if 90% of an algorithm is parallelizable, the overall improvement can be no more than 10x. An assumption made by Amdahl’s law is that the problem size remains the same when the problem is parallelized. However, this is not the norm and Amdhal’s law just describes the best-case scenario. Splitting data for processing between many different processing units, results in I/O and memory overheads. These overheads negatively affect overall performance.

I/O and Memory Issues

It is well known that an optimal assignment of threads to processing units is NP-hard, even when threads compete for a single processing unit, and the demands for this resource are known a priori. In practice, the assignment problem is exaggerated by the unpredictable and dispersive nature of the resource requirements of the threads, as well as by memory, I/O or inter-process communication between threads of the same process. Reducing the I/O and memory overheads as well as the communication delays between the threads is an open research problem. However, there is no optimal solution that can minimize the aforementioned overheads for all problems. Each application needs to be extensively profiled and an appropriate memory management and a resource allocation scheme need to be defined.

Memory can be shared between threads — i.e. all threads access the same data — or distributed — i.e. all memory is local for each thread and sharing is done by transferring data between units or even a mixture of both. Shared memory offers a unified address space; however is prone to race conditions for multiple processing units request/access the same chunk of data. Distributed memory excludes race conditions, but forces the developer to think about data distribution and how the units need to communicate between them. A hybrid distributed/shared solution allows for scalable designs; however despite hiding the mechanism of data communication, it does not hide the delays. The design and implementation of a resource allocation scheme for a particular computation problem is heavily impacted by the targeted computation engine’s architecture and memory scheme as well. Options and designs are different for multicore processors, clusters and dedicated hardware implementations.

Final Remarks

Using new emerging computation platforms such as GPUs and FPGAs can offer hundreds of processing cores for executing inherently computationally intensive applications. However, limitations do exist. Extreme parallelization of applications means that many processing cores will need data to work on. Thus, such applications are been transformed into memory intensive due to the use of parallelism. GPUs have limited amount of memory per core and FPGAs have limited amount of on-chip memory. As such, memory becomes critical to the parallelized applications. Developers must rely on external memories such as DRAMs. However, these large external memory systems still lag behind in feeding the hundreds of available processing units in time, as the existing design practices (DRAM DIMMs for example) limit the available data bandwidth. For the aforementioned reasons, memory and I/O transferring rates pose a limit on performance gains when mapping applications on these now-emerging computation platforms.