Parallel Computing and OpenMP

Parallel Computing is not new but the applications are enormous. One such recent application that is taking advantage of the parallel computing is Neural Networks for Computer Vision and other Deep Learning problems. The purpose of this article is to bring some insights into the Parallel Computing and basic programming introduction to OpenMP (Open Multi Processing) frame work. See, https://www.openmp.org/

Every processor has at its core does a parallel operations. Every Instruction that its executes is having different stages — Instruction Fetch, Decode, Operand Fetch and Execute. These stages across instructions can be parallelized to get the better performance. For example, while the second instruction is being decoded, the processor can fetch next instruction in the sequence.

Fig 1 — Instruction Pipeline

In the above figure, at the highlighted (in yellow) stage of the pipeline, 4 different operations of various instructions are being processed in parallel. Though the instruction pipeline is performance boosting option, it suffers from the problems like, data dependency.

For example, consider below machine language instructions:

add R1, 1000 # Add value 1000 to the contents of R1 register.

mul R1, 300 # Multiply value 300 by the contents of R1 register.

add R2, 200 # Add value 200 to the contents of R2 register

Both the instructions are trying to access the same register R1. In this case, above two instructions cannot be parallelized. One solution to the above problem could be processor should have the intelligence to look forward beyond next instruction in the sequence and check for the possibility of parallelize other instruction in sequence it. In the above sequence, the third instruction can be taken for execution in parallel to the first instruction — It is performing operation in register, R2. In this way, processor can maintain the window of instructions to scan to identity the possible parallel instructions to execute — This is called super pipelining with out of order execution

Implementing this at the hardware level is complex and the next option is VLIW — Very Long Instruction Words (https://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2833)

The VLIW moves the complexity from the Hardware to Compiler. While compiling the source code, the compiler will look for the opportunities to parallelize the instructions so that these instructions can be deployed onto parallel multiple functional units on H/W.

Following need to be known before writing the parallel programs:

Cache Coherency is the important concept to consider while writing the parallel programs. The changes made by processors in their respective caches need to have synced to original memory location.

Cache Size: Cache is the memory that is close to processor. More the cache size more the performance. Since we have limited cache memory, it is programmer job to best utilize the cache.

Cache Line Size: This is the size each entry in cache.

Temporal Locality: If a data that is fetched into cache and it is used again and again, that is called temporal locality.

Spatial Locality: As and when cache access a memory location to fetch data, it also fetches the next consecutive memory locations into cache. Programmer can take advantage of this by organizing the data in memory based on the program need. For example, storing all the vector value in sequence in memory to compute DOT products of the vectors.

OpenMP — Open Multi Processing

OpenMP is a framework that enable to write parallel programs. Programmer can direct the compiler to insert necessary code in program to make it parallel. This will be done by the #pragma statements provided by the framework.

One can use any compiler that supports OpenMP. Here, I am using Cygwin installation on windows machine.

#include <omp.h>

gcc -fopenmp <program_name>.c

The environment variable, OMP_NUM_THREADS can be used to set the number of thread need to be launched by the Open MP

Below command will set the number of threads to 4

$export OMP_NUM_THREADS=4

First OpenMP program:

First OpenMP program

Look at the above code. Immediately after the #PRAGMA OMP PARALLEL statement, the frame work insert necessary code to launch OMP_NUM_THREADS number of threads. Open MP uses fork-join model to launch the threads. End of the parallel block is a synchronization point, where all the threads join.

OpenMP Fork-Join Model demonstration

See the above code. It is evident that there is only one thread before PRAGMA OMP PARALLEL statement.

OpenMP — Parallel execution of FOR loop

In the above code, I introduced another PRAGMA statement which parallelize the FOR loop iterations across threads. The schedule mechanism can be STATIC or DYNAMIC. In case of Dynamic, the unfinished iterations can be assigned to threads which are free during run time. The second parameter (in this case 100) is a chunk size. In the code, we have 10000 iterations and since chunk size is 100, each thread will get 1000 chunks (10000 / 100) of size 100.

If you execute the above code, you almost never see the right answer — In this case 10000. Why? This is where Cache Coherency comes in.

Critical Section for global variables

Here, I changed the code to guard the global variable access by announcing the code that is changing the global variable as critical section (while one thread executing the code in critical section, other thread is not allowed to execute the same code). See the execution time — It is lot, we will improve further!!

Partial Computations

Unless mentioned, Open MP frame work assumes every variable as shared across threads. Programmer responsibility is to tell the frame work that, he \ she need a variable as private to threads. In this case, each thread will have a copy of the variable on its stack.

Here is the modified code to have thread specific local variable to maintain the partial sum calculated by each thread. The code that adds the partial sum back to global shared variable (in this case ‘sum’ variable), I need to avoid the race condition across threads — so, I made it as critical section.

Open MP reduction statement

I improved the code further by taking another Open MP directive — reduction. This will enable the programmer to tell frame work that the variable need to be guarded against multiple thread executions.

Summary:

I tried to give basic introduction to parallel computing and OpenMP. Here are more references on Open MP frame work:

https://www.openmp.org/resources/tutorials-articles/

Thanks for reading !! Leave response if you have more experiences with parallel computing or Open MP.

Software Developer, Data Science Enthusiast.