Parallel Programming, what is it?

Technical Overview on Parallel Programming.

Carmine De Fusco
The Blog of a Computer Scientist
14 min readAug 29, 2019

--

The de facto standard in processor architectures around the world, from PCs to smartphones, is to write code that can exploit parallel computing.

This is a technical discussion of parallel (and concurrent) programming that answers various questions and addresses the key topics that anyone who deals with the design and development of software that uses multicore and multithreading should know.

What are the main reasons for using parallel computing?

The main reason is to save time and money, as more resources simultaneously solve problems faster. Parallelization and therefore parallel computing allows data to be processed in parallel instead of the classic sequential approach. The economic advantage derives mainly from the fact that even cheaper computers, once connected in parallel, can be useful to solve complex problems efficiently. The processing of a large amount of data ( Big Data ) is another of the main reasons why it is preferable and more convenient to use parallel computing.

Flynn’s taxonomy.

The Flynn taxonomy classifies parallel computers along two dimensions, Istruction and Data , each of these two dimensions can have two values, Single and Multi . The following table shows the four possible classifications according to the Flynn taxonomy.

Taxonomy of Flynn

SISD ( Single Istruction Single Data ): This is what we find in a classic computer that processes data not in a parallel way but in a sequential manner. During each clock cycle, only one instruction and one set of data is executed. It is the classic architecture found on common non-parallel PCs.

SIMD ( Single Istruction Multiple Data ): Classic example of a parallel computer. In each clock cycle all CPUs execute the same subset of instructions, with the difference that these instructions process a different set of data. It is a common architecture used, for example, on GPUs ( Graphics Processing Units) .

MISD ( Multiple Istruction Single Data ): A single set of data is processed by different processing units, each of which executes its own instructions independently of the others. There are not many tangible examples of the use of these types of architectures. A use could be to execute different types of cryptographic algorithms (multiple instruction ) to decipher a single message ( single data ).

MIMD ( Multiple Istruction Multiple Data ): Each processing unit can execute a different subset of instructions on a different subset of data. This architecture is the most common of parallel computers. However, many MIMD architectures contain SIMD subcomponents. Some examples of these types of architectures are: Network Parallel Computer Clusters and Grid, Computer multiprocessor SMP, multi-core PCs.

What is the granularity of a parallel program?

It is a qualitative measure that gives information on the relationship between computation and communication. Two types of granularity can be distinguished:

Fine-Grain : Type of parallelism called fine-grained. In this case, small amounts of computation are carried out between each communication event. The ratio has low values ​​and allows a load balancing at the expense of overheads and increased performance. It is advisable to avoid that the granularity is not too fine as in this case we would have that the communication time, and therefore the overhead derived from it, is greater than the actual computation time.

Coarse-Grain : Type of parallelism called coarse grain. In this case large amounts of computation are carried out between each communication. The ratio has high values ​​and although it indicates more ease of performance increase, it leads to not getting a good load balance.

What is the overhead of a parallel program and what does it consist of (what is generated)?

Overhead is nothing more than the time it takes to manage communications or more generally to coordinate parallel activities. This time we want it not to be excessive as it is time taken away from useful work. The overhead can be generated by various factors, such as the following:

Task start and termination time: The time it takes to initialize all tasks before they start to do useful work, and the time it takes to finish the tasks themselves once their task is over.

Synchronization times : The time taken to synchronize the various tasks that run in parallel and concurrent with each other.

Communication time: The time spent in communicating the various tasks between them in the eventual exchange of data and / or messages.

Software and hardware latency time: Possible latency times caused by the software, for example the operating system, or by the hardware, for example the expectation of an input from a peripheral device.

Shared Memory architectures (UMA and NUMA) with advantages and disadvantages

In order to communicate, processes can use a shared memory space sometimes called Shared Memory. Every time a process writes into the shared memory, the changes made will be visible also by the other processes that share the same memory area. The architectures that allow the use of a shared memory can be divided into two types:

UMA (Uniform Memory Access) : It is represented by SMP machines ( Symmetric MultiProcessing ). The processors are homogeneous and have the same memory access policy. Moreover the latter, in most implementations, is cache coherent, ie a change by a processor in the shared memory is notified directly to the others.

NUMA ( Non-Uniform Memory Access ): It is implemented by physically connecting two or more SMPs. The access time to the memories is not the same for all processors, even in this case it is possible to obtain coherent cache memory.

The advantages and disadvantages of using Shared Memory architectures (UMA and NUMA) are described below:

Advantages : The global addressing model and the sharing of data between tasks quickly and uniformly helps the programming on this type of architecture.

Disadvantages : Mainly the lack of scalability between memory and CPU. The more CPUs increase, the more traffic increases, and synchronization in parallel and concurrent access to shared memory brings considerable overhead.

Distributed Memory architectures with advantages and disadvantages

Instead of a bus, a network is used to interconnect the memories of the different CPUs. There is no global address space, and each processor has its own local memory. Thus each processor operates independently of the others on its memory, and any change in the latter must not be visible to others. In other words there is no need to maintain the cache coherency . If a processor needs to access the memory of another processor it is the programmer’s job to indicate the way in which this must happen. So it’s up to the programmer to define and implement synchronization. The network can be of various types, for example it could simply be a classic Ethernet network.

Advantages : The memory is scalable with the number of processors, or the increase in the number of processors and memory are proportional. Each processor accesses its own memory without taking into account the need to synchronize the memory, as there is a lack of cache coherency.

Disadvantages : The programmer is responsible for the communication and eventual synchronization of the processes. Access is not uniform and it is difficult to map existing data structures on distributed memories.

Hybrid architectures

Hybrid architectures combine the advantages and disadvantages of the architectures seen above. The shared memory component can be a SMP shared memory machine, with cache coherency and / or graphics processing unit (GPU — Graphics Processor Units). Processors on the same SMP can map memory as global. The distributed component is the network layer that allows you to communicate and move data from one SMP to another.

What is the shared memory programming model?

In the shared programming parallel memory model , tasks share a memory space in which they can write and / or read asynchronously. Various mechanisms such as locks and traffic lights can be used to manage access to shared memory. Synchronization mechanisms can be implemented both from the underlying hardware and directly from the programmer.

What is the multithreading programming model?

In the multithreading parallel programming model a single process can have multiple execution paths simultaneously. This can be seen as a single program (the process) that executes several subprograms (threads) in parallel and concurrently. Each thread has both its own local data and global data shared with all the other threads generated by the same process. This also makes it possible to facilitate communication between the various threads thanks to the use of the global memory they share. Also in this case synchronization mechanisms are needed, such as locks, traffic lights or monitors, which allow access to the shared global memory among the threads belonging to the same process. The implementation of multithreaded code depends both on the programming language,

What is the message passing programming model?

It is a parallel programming model based on message exchange, MPI ( Message Passing Interface ), that is, it sees tasks that exchange send and receive messages . The transmission of messages, or more properly data, requires a certain coordination between the various operations. Therefore it is necessary that each send operation requires a receive operation connected to it, and vice versa. Programmers who want to use this programming model use libraries available in various languages ​​that implement the MPI paradigm and that allow to obtain procedures that perform send and receive operations. An example of MPI implementation is OpenMPI available for various languages ​​such as C and Fortran.

What is the data parallel programming model?

The parallel programming model called Data Parallel, is sometimes also referred to as PGAS ( Partitioned Global Address Space ). Working on a set of data in a shared data structure, usually an array or a cube, is split between various tasks. Each task works on different parts of the structure by performing the same operation on its own work partition. On shared memory architectures, shared memory, all tasks can have access to the data structure through global memory. On distributed memory architectures, the data structure is split , which is then divided into blocks in the local memory of each activity. The most popular implementations are Fortran 90 and 95, High Performance Fortran.

What is the goal of parallelizing parallel programs and how do the fully automatic compilers differ from the directed program compilers ?

It is possible to use tools for automatic parallelization of programs. These tools assist the programmers and provide to convert the serial programs into parallel programs. The most common and used tools are compilers and parallelizing pre-processors. These can be divided into fully automatic compilers and directed programmer compilers :

Fully Automatic : The code is analyzed during compilation and a portion of parallelizable code is searched. Cycles like the while and the for are the best candidates for this purpose.

Programmer Directed : When writing code, use compiler directives or even compilation flags. In this way it is the programmer that explicitly tells the compiler how to parallelize the code. This approach can also be used in conjunction with automatic parallelization mechanisms.

What is the goal of the partitioning technique?

The goal of partitioning is to divide the problem into subproblems, this division can be done in different ways and depends strictly on the type of problem you are facing. The division of the problem into smaller problems allows solving each subproblem in parallel. In other words, the initial task in designing a parallel program is to break the problem into portions, blocks, discrete work that can be distributed to multiple tasks simultaneously. In particular there are two main partitioning techniques called domain decomposition and functional decomposition .

What is the domain decomposition technique?

The domain of the problem, that is the data that represent and belong to the problem, is divided and each parallel task works on a smaller problem and therefore on a different portion of data.

What is the functional decomposition technique?

In functional decomposition , instead of decomposing the data that represent the problem, we try to decompose the computation itself. In this way we try to divide the problem not based on which portion of the data each task has to perform but on the basis of which type of computation the task should perform. So the key point of the decomposition is on the computation to be performed, rather than on the data manipulated by the computation. Following this approach, the problem is broken down according to the work that needs to be done and each activity performs only a specific part of the overall work.

What are the load balancing issues?

The load balancing, or load balancing, tries to distribute the same workload to each task, so as to avoid that there are tasks that have a much greater computational work than others. In other words, the load balancing tries to minimize the work imbalance between the various tasks, and therefore makes sure that each task is committed to doing something while minimizing inactivity. Load balancing is particularly important in parallel systems and programs as it can help increase performance. A good load balancing can be achieved by fairly equipping the work that each task receives. With a set of homogeneous machines it is easier to optimize the load as each machine has the same computational capabilities, this is not said in cases where the machines are heterogeneous as it can happen that with the same workload one machine ends before another. In these cases it is important to try to give heavier loads to faster machines and lighter loads to slower machines, as the overall calculation time is given by the time the slower machine takes. Clearly a good load balancing depends especially on how the job partition is done, and in the case of the domain decomposition that in the case of functional decomposition . In the first case, it could happen that the domain is divided in an unfair way, and therefore some tasks are found to make a calculation heavier than others. Moreover, even in the second one, it could happen that the functional decomposition leads to have inhomogeneous tasks. In particular, for the domain decomposition, it should be noted that some problems can lead to load imbalances even if the data is divided equally between the tasks. This can happen, for example, with sparse matrices where some tasks have a lot of work to do, while other tasks are almost null due to the presence of many zeros in the matrix part assigned to them. To avoid problems of imbalance, such as these, it is good to use instead of static algorithms that assign the load only at the beginning of the computation, algorithms that detect and dynamically manage the load imbalances that occur during processing. Often this is done thanks to a scheduler associated with a task pool so that as soon as a task finishes its work, it is put in a queue to take another, this is also useful because it avoids creating and terminating tasks, an operation that sometimes leads to a considerable overhead, so as to always reuse the same tasks present in a pool.

What are the factors that influence communication?

The degree of communication between tasks depends strictly on the type of problem. Some problems, called embarrassingly parallel , do not require that there is a high inter-task communication, such as problems that deal with image processing , where the processing done on each pixel, or subset of pixels, is independent of the others. Although these types of problems therefore exist, there are many others that can be parallelized but a high degree of inter-task communication is also required. Communication between tasks involves evaluating various factors:

Communication cost : Machine cycles are used to pack and send data, rather than doing useful work, thus leading to considerable overhead when the degree of communication is significantly high.

Latency and Bandwidth costs: Latency costs are caused by the time it takes to send a message from one node to another. The bandwidth , or bandwidth, indicates the amount of data that can be transmitted in the unit of time, for example MB / s. The bandwidth costsdepend on the number of messages that are exchanged at a given instant and / or by the amount of data you want to exchange, so it is important to check these parameters to ensure that the available bandwidth is not saturated.

Synchronous and Asynchronous Communication : With synchronous communications there is a sort of handshake between sender and recipient, in the case of synchronous communications we often have to do with blocking communication , that is until the communication is finished sender and recipient must wait. This type of communication can lead to performance degradation as you can spend a lot of time waiting for a message to terminate instead of doing useful work. As far as asynchronous communications are concerned, no type of handshake is requiredbetween sender and recipient, in this case we speak of non-blocking communication. In this way, between the beginning and the end of a communication both sender and recipient can perform useful work, sometimes leading to better performance than in the case of blocking communications.

Type of communication : It is important, in the design phase, to know how and which tasks must communicate with each other. There are mainly two types of communication: point-to-point and collective, both of which can be implemented both with synchronous and asynchronous communications. In point-to-point communications there are two tasks that communicate with each other, where one is the sender and the other is the receiver . In collective communicationsthere are more than two tasks, often identified by a group, that communicate with each other and in such a way that either there is only one receiver and the others are all senders(case of gather communication ) or there is a single sender and the others are all receivers (case of scatter or broadcast communications ) or even a combination of these.

What are the primitives of broadcast communication , scatter , gather and reduction and why are they used?

Broadcast : There is a single process that sends a copy of the same data to all the processes in your group. So we have a single sender and a subset of receivers .

Scatter : It is a special case of broadcast . There is a single process that sends a certain portion of your data to every process in your group. So unlike broadcast where the process sent the same piece of data to everyone else, in this case the process sends a different portion of the data to each other, so that at the end of the communication each receiving process got only a specific part of the data present initially in the single sending process. Thus in this case we also have a single sender and a subset of receivers .

Gather : It is the reverse case of the scatter . There is a subset of processes that send their data to a single process, so that the receiving process will have a single data structure consisting of each piece of data sent to it by the other sending processes. So we have a single receiver and a subset of senders .

Reduction : It is a special case of gather . in fact, in this case the data is received and aggregated, or rather some operation is performed on them once received. The operations can be for example sum operations, average, minimum or maximum, carried out among all the data sent to the only receiving process. In other words it is a collective operation in which a single process, the receiver , sometimes called root , collects data from other processes of the group, the senders , and organizes the operation on the data producing a single value.

Why does data dependence influence the parallelization of a program?

The data influence parallelization as they can be an inhibitor of parallelism. For example, in the case of image processing it is often possible to parallelize the problem because each pixel, or subset of pixels, is independent of the others. There are problems in which instead the data are not independent of each other and therefore the problem becomes difficult to parallelize.

NB: For further information, I recommend the following link:

Introduction to Parallel Computing

--

--

Carmine De Fusco
The Blog of a Computer Scientist

Computer Scientist in general, Software Engineer in detail, Visionary for someone. Contact me here: cardefusco (at) gmail (dot) com