Parallel Computing using OpenMP

API specification for parallel programming #CSeries — Episode #08

Published in

Jungletronics

16 min readAug 27, 2024

OpenMP (Open Multi-Processing) is an API that provides a simple and flexible interface for parallel programming in C, C++, and Fortran.

It allows developers to write programs that can run efficiently on multi-core and multi-threaded systems by using compiler directives, libraries, and environment variables.

The primary purpose of OpenMP is to enable the creation of parallel applications to improve performance and reduce execution time on shared-memory architectures.

Let’s start by running some C code in VSCode and discussing the key characteristics that matter most:

0 Hello World!

// openMP_0.c

#include <omp.h>
#include <stdio.h>

int main()
{
    printf("INI.\n");

#pragma omp parallel
    {
        printf("I have a passion for parallelism!\n");
    }

    printf("END.\n");
    return 0;
}

gcc -fopenmp openMP_0.c -o openMP_0 && "./openMP_0"

INI.
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
I have a passion for parallelism!
END.

The program uses OpenMP to print a message concurrently from multiple threads, showcasing parallel processing. The INI. and END. messages are printed once, while the parallel message is printed by each thread. The exact number of lines depends on the number of threads OpenMP is using.

Here, it is printed 12 times, suggesting that 12 threads were spawned in parallel. Each thread executes the printf statement independently.

I’m running this on a 12-core notebook. Each core executed the `printf` statement independently and simultaneously. Incredible!

My System:

OS: Ubuntu 24.04 LTS
Processor: Intel® Core™ i7–9750H × 12 — 8.0 GiB RAM

Key Points

Parallel Execution: The #pragma omp parallel directive creates a parallel region. Within this region, the enclosed code is executed by multiple threads concurrently.
Output Ordering: The order of the output lines from the parallel region (I have a passion for parallelism!) is not guaranteed because they are executed by different threads. The output can vary between runs depending on thread scheduling.
Start and End Messages: The INI. and END. messages are outside the parallel region, so they are executed once by the main thread.

1Spawning a chosen number of threads:

// openMP_1.c

#include <omp.h>
#include <stdio.h>

int main()
{
#pragma omp parallel num_threads(4)
    {
        int id = omp_get_thread_num();
        printf("I am thread number %d.\n", id);
    }
    return 0;
}

This code demonstrates a simple use of OpenMP to run a parallel block of code with multiple threads. Here’s a brief explanation:

#pragma omp parallel num_threads(4): This directive tells OpenMP to create a parallel region where four threads are spawned. Each thread will execute the code within the braces {} independently.
int id = omp_get_thread_num();: This line retrieves the unique ID of each thread. omp_get_thread_num() returns an integer representing the thread number, ranging from 0 to num_threads-1. Since num_threads is set to 4, possible thread IDs are 0, 1, 2, and 3.
printf("I am thread number %d.\n", id);: Each thread prints its unique ID. Since this is within a parallel region, all threads execute this line simultaneously but independently, resulting in multiple print statements.

gcc -fopenmp openMP_1.c -o openMP_1 && "./openMP_1"

I am thread number 0.
I am thread number 2.
I am thread number 1.
I am thread number 3.

The output will consist of four lines, each indicating the ID of a thread, but the order may vary due to the concurrent execution of threads.

2 Explicit Control Over Thread Work Distribution:

// openMP_2.c

#include <stdio.h>
#include <omp.h>
#define TOTAL 2048

int main()
{
    int A[TOTAL];
#pragma omp parallel
    {
        int id = omp_get_thread_num();
        int nt = omp_get_num_threads();
        int tam = TOTAL / nt;
        int ini = id * tam;
        int fim = ini + tam - 1;
        int i;
        for (int i = ini; i < fim; i++)
        {
            A[i] = i * i;
            printf("Th[%d] : %02d = %03d\n", id, i, A[i]);
        }
    }
    return 0;
}

Provides more explicit control over thread work distribution but requires manual computation of indices and ranges.

This program demonstrates a simple use of OpenMP to parallelize a computational task, where each thread is responsible for computing a portion of an array’s elements independently. It showcases how OpenMP can be used to efficiently divide work among threads and perform parallel computations on shared data structures like arrays.

Key Points:

Parallel Region: This code uses #pragma omp parallel to create a parallel region. Each thread executes the code within this parallel region.
Thread Identification: Inside the parallel region, threads obtain their ID with omp_get_thread_num() and the total number of threads with omp_get_num_threads().
Work Division: The code manually calculates the start (ini) and end (fim) indices for each thread based on its ID and the total number of threads. Each thread processes a chunk of the array (A), ensuring no overlap between threads.
Print Statements: Each thread prints its results, showing the thread ID and the index it processed.

Advantages:

Explicit Work Division: This approach provides control over how the workload is divided among threads.

Disadvantages:

Manual Index Calculation: You need to manually calculate the range for each thread, which can be error-prone and less flexible.

3 Automatic Work Distribution with Parallel For:

// openMP_3.c

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define TOTAL 2048
int main()
{
    int A[TOTAL];
#pragma omp parallel for
    for (int i = 0; i < TOTAL; ++i)
    {
        A[i] = i * i;
        printf("Th[%d]: %02d=%03d\n", omp_get_thread_num(), i, A[i]);
    }
    system("pause");
    return 0;
}

Utilizes parallel for for automatic work distribution, making the code simpler and easier to manage, though with less fine-grained control over thread workload.

Key Points:

Parallel For Loop: This code uses #pragma omp parallel for, which is a shorthand for parallelizing a for loop. OpenMP handles the division of work among threads automatically.
Automatic Work Division: The parallel for construct automatically divides the iterations of the loop among the available threads, so you don't need to manually calculate indices or ranges.
Print Statements: Each thread prints its results, similar to the previous example.

Advantages:

Simplified Code: Using parallel for simplifies the code by automatically managing work distribution, reducing the likelihood of errors.
Ease of Use: This approach is more straightforward and less error-prone for parallelizing loops.

Disadvantages:

Less Control: While it simplifies code, you have less control over how the work is divided among threads compared to manual division.

4 The Importance of the Barrier Directive for Thread Synchronization:

// openMP_4.c

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define SIZE 1000

double A[SIZE], B[SIZE];

double funcao_complexa_1(int id)
{
    return id * 2.0;
}

double funcao_complexa_2(double *arr, int id)
{
    return arr[id] + 2.0;
}

int main()
{
    double start, end;
    double cpu_time_used;

    // Start timing
    start = omp_get_wtime();

// Initialize OpenMP parallel region
#pragma omp parallel
    {
        int id = omp_get_thread_num();
        int num_threads = omp_get_num_threads();

// First parallel region: Compute A[id]
#pragma omp for
        for (int i = 0; i < SIZE; ++i)
        {
            A[i] = funcao_complexa_1(i);
        }

// Synchronize threads to ensure A is fully computed
#pragma omp barrier

// Second parallel region: Compute B[id]
#pragma omp for
        for (int i = 0; i < SIZE; ++i)
        {
            B[i] = funcao_complexa_2(A, i);
        }
    }

    // End timing
    end = omp_get_wtime();
    cpu_time_used = end - start;

    // Print the results for verification
    for (int i = 0; i < SIZE; i++)
    {
        printf("A[%d] = %f, B[%d] = %f\n", i, A[i], i, B[i]);
    }

    // Print the execution time
    printf("OpenMP Execution Time: %f seconds\n", cpu_time_used);

    return 0;
}

This code demonstrates the use of OpenMP for parallel processing in C. It performs the following tasks:

Initialization: Sets up arrays A and B, and defines two complex functions for computation.
Timing: Measures the execution time using omp_get_wtime().
Parallel Execution: Uses OpenMP to parallelize the computation in two stages: first, filling array A with values, then computing array B based on A.
Synchronization: Ensures that all threads complete computing A before starting the computation for B using #pragma omp barrier.
Output: Prints the computed arrays and the execution time.

The code highlights the efficiency gains from parallelizing tasks and timing the overall execution.

5 Sequential Computation in Non-OpenMP Code:

// openMP_5.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SIZE 1000

double A[SIZE], B[SIZE];

double funcao_complexa_1(int id)
{
    return id * 2.0;
}

double funcao_complexa_2(double *arr, int id)
{
    return arr[id] + 2.0;
}

int main()
{
    clock_t start, end;
    double cpu_time_used;

    // Start timing
    start = clock();

    // Compute A[i]
    for (int i = 0; i < SIZE; ++i)
    {
        A[i] = funcao_complexa_1(i);
    }

    // Compute B[i]
    for (int i = 0; i < SIZE; ++i)
    {
        B[i] = funcao_complexa_2(A, i);
    }

    // End timing
    end = clock();
    cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;

    // Print the results for verification
    for (int i = 0; i < SIZE; i++)
    {
        // printf("A[%d] = %f, B[%d] = %f\n", i, A[i], i, B[i]);
    }

    // Print the execution time
    printf("Non-OpenMP Execution Time: %f seconds\n", cpu_time_used);

    return 0;
}

The purpose of this code is to provide a baseline for comparing execution times with the previous OpenMP code (4). The OpenMP code generally runs more quickly than the non-OpenMP code for the following reasons:

Parallelism: The OpenMP code uses multiple threads to parallelize the computations. This allows it to leverage multiple CPU cores, which can significantly reduce the execution time, especially for large datasets or complex calculations.
Synchronization: The OpenMP code includes a barrier directive to synchronize threads, ensuring all threads complete their tasks on array A before proceeding to array B. This efficient handling of parallel tasks contributes to faster overall execution.

In contrast, the non-OpenMP code runs sequentially on a single core, making it slower, especially as the size of the data or complexity of calculations increases. Therefore, the OpenMP code is expected to run more quickly compared to the non-OpenMP version.

But let’s compare this two simple codes:

7 Non-OpenMP code:

// openMP_7.c

#include <stdio.h>
#include <time.h>
#define TOTAL 2048 // 100000

int main()
{
    int A[TOTAL];
    clock_t start, end;
    double cpu_time_used;

    // Start timing
    start = clock();

    for (int i = 0; i < TOTAL; ++i)
    {
        A[i] = i * i;
        // Comment out printf to reduce I/O overhead
        // printf("%02d=%03d\n", i, A[i]);
    }

    // End timing
    end = clock();
    cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;

    // Print the execution time
    printf("Non-OpenMP Execution Time: %f seconds\n", cpu_time_used);

    return 0;
}

8 OpenMP Code:

// openMP_8.c

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define TOTAL 2048 // 100000

int main()
{
    int A[TOTAL];
    double start, end;
    double cpu_time_used;

    // Set the number of threads
    omp_set_num_threads(4);

    // Start timing
    start = omp_get_wtime();

#pragma omp parallel for
    for (int i = 0; i < TOTAL; ++i)
    {
        A[i] = i * i;
        // Comment out printf to reduce I/O overhead
        // printf("Th[%d]: %02d=%03d\n", omp_get_thread_num(), i, A[i]);
    }

    // End timing
    end = omp_get_wtime();
    cpu_time_used = end - start;

    // Print the execution time
    printf("OpenMP Execution Time: %f seconds\n", cpu_time_used);

    return 0;
}

Running codes 7 and 8 above with an array size of 2028 yields the following results:

gcc -fopenmp openMP_7.c -o openMP_7 && "./openMP_7"
Non-OpenMP Execution Time: 0.000005 seconds
...
gcc -fopenmp openMP_8.c -o openMP_8 && "./openMP_8"
OpenMP Execution Time: 0.000141 seconds

Initially, the non-OpenMP code performed better with smaller arrays. However, the advantages of using OpenMP become clear when handling larger arrays, particularly those exceeding 100,000 elements. In such cases, parallelism with OpenMP significantly improves performance by leveraging multiple CPU cores to handle the increased workload more efficiently.

Now, running codes 7 and 8 above with an array size of 100,000 yields the following results:

Time7:
gcc -fopenmp openMP_7.c -o openMP_7 && "./openMP_7"
Non-OpenMP Execution Time: 0.000190 seconds
....
Time8:
gcc -fopenmp openMP_8.c -o openMP_8 && "./openMP_8"
OpenMP Execution Time: 0.000281 seconds

Time8 is approximately 0.676 times slower than Time7.

Now, running codes 7 and 8 with an array size of 200,000 yields the following results:

Time7:
gcc -fopenmp openMP_7.c -o openMP_7 && "./openMP_7"
Non-OpenMP Execution Time: 0.000439 seconds
...
Time8:
gcc -fopenmp openMP_8.c -o openMP_8 && "./openMP_8"
OpenMP Execution Time: 0.000281 seconds

Time8 is approximately 1.56 times faster than Time7.

Above this turning point (200,000), it is recommended to use the OpenMP algorithm.

Yes, our observation is correct. When measuring performance with smaller array sizes, the overhead of using OpenMP can outweigh the benefits, causing the non-OpenMP algorithm to perform better. Here’s why and what the advantages of using OpenMP are:

Why OpenMP May Be Slower for Smaller Tasks?

Overhead of Parallelization: OpenMP introduces overhead due to thread creation, synchronization, and management. For smaller tasks or tasks that require minimal computation (like simple loops over small arrays), this overhead can be significant relative to the time spent performing the actual computations.
Thread Synchronization: In parallel code, especially with OpenMP, threads need to synchronize to ensure data consistency and avoid race conditions. This synchronization takes time, adding to the overhead.
Insufficient Task Granularity: If the task size is too small, each thread might not have enough work to justify the cost of spawning and managing threads. This can result in inefficiencies and make the parallel version slower than the sequential one.

Advantages of Using OpenMP:

Scalability for Larger Tasks: OpenMP is designed to improve the performance of applications that can benefit from parallelization, especially when dealing with large data sets or computationally intensive tasks. For example, with larger array sizes (like 1,000,000 elements), the benefits of parallel processing outweigh the overhead, leading to significant speedups.
Ease of Parallelization: OpenMP allows developers to parallelize code with minimal modifications using pragmas (#pragma directives). This makes it easier to convert existing sequential code into parallel code without deep restructuring.
Improved Resource Utilization: OpenMP can make better use of available CPU resources by distributing workloads across multiple cores. This is especially beneficial on multi-core and multi-processor systems, where each thread can run on a separate core, effectively utilizing all available hardware.
Dynamic and Flexible: OpenMP supports dynamic scheduling, which means it can dynamically allocate tasks to threads, balancing the load and ensuring efficient utilization of resources. It also allows for controlling the number of threads, shared and private variables, and more.
Support for Different Levels of Parallelism: OpenMP supports various parallelism forms, including task parallelism and data parallelism, making it suitable for a wide range of applications, from simple loops to more complex algorithms.
Portability: OpenMP is supported by many compilers and works across different platforms, making the parallel code more portable and easier to maintain.

When to Use OpenMP:

Computationally Intensive Tasks: When your application has heavy computations that can be divided among multiple threads.
Large Data Sets: When you are processing large arrays or matrices where each element can be computed independently.
Embarrassingly Parallel Problems: Problems that can be easily divided into parallel tasks with minimal communication between them.
Long-Running Tasks: Tasks that take a long time to execute sequentially but can be broken down into smaller parts to run concurrently.

9 Mutual Exclusion:

// openMP_9.c

#include <stdio.h>
#include <omp.h>

#define TOTAL 1000

// Example of complex function calculations
float funcao_complexa_1(int i)
{
    return (float)(i * 2.0); // Simple example function
}

float funcao_complexa_2(float B)
{
    return (float)(B + 2.0); // Simple example function
}

int main()
{
    float res = 0.0; // Shared variable to accumulate results

#pragma omp parallel
    {
        float B; // Private variable for each thread
        int i, id, nthreads;

        id = omp_get_thread_num();        // Get the thread ID
        nthreads = omp_get_num_threads(); // Get the number of threads

        // Divide the work among threads
        for (i = id; i < TOTAL; i += nthreads)
        {
            B = funcao_complexa_1(i); // Perform a complex computation

// Critical section: only one thread can execute this at a time
#pragma omp critical
            {
                res += funcao_complexa_2(B); // Update the shared variable
            }
        }
    }

    // Output the result after parallel computation
    printf("Result: %f\n", res);

    return 0;
}

To understand mutual exclusion using OpenMP, we need to ensure that certain parts of the code are executed by only one thread at a time. This is achieved using the #pragma omp critical directive, which allows for mutual exclusion in a parallel region.

The code simulates a scenario where multiple threads update a shared variable (res). The #pragma omp critical section ensures that only one thread at a time can execute the critical section, preventing race conditions.

10 Applying OpenMP Concepts to Aviation Ticketing Systems:

// openMP_10.c

#include <stdio.h>
#include <omp.h>

#define TOTAL_SEATS 1000 // Total number of seats available
#define THREADS 8        // Number of threads to simulate multiple users

// Simulate a complex function to calculate booking or cancellation
int book_ticket(int current_seat_count)
{
    // Simulate some computation
    return current_seat_count - 1;
}

int cancel_ticket(int current_seat_count)
{
    // Simulate some computation
    return current_seat_count + 1;
}

int main()
{
    int available_seats = TOTAL_SEATS; // Shared resource

// Start parallel region
#pragma omp parallel num_threads(THREADS)
    {
        int id = omp_get_thread_num(); // Get thread ID

        // Simulate booking or cancellation by each thread
        for (int i = 0; i < 100; ++i)
        { // Each thread makes 100 attempts
            if (id % 2 == 0)
            { // Even ID threads try to book tickets
#pragma omp critical
                {
                    if (available_seats > 0)
                    { // Check to avoid overbooking
                        available_seats = book_ticket(available_seats);
                        printf("Thread %d booked a ticket. Seats left: %d\n", id, available_seats);
                    }
                    else
                    {
                        printf("Thread %d tried to book a ticket, but none were available.\n", id);
                    }
                }
            }
            else
            { // Odd ID threads cancel tickets
#pragma omp critical
                {
                    if (available_seats < TOTAL_SEATS)
                    { // Check to avoid exceeding total seats
                        available_seats = cancel_ticket(available_seats);
                        printf("Thread %d canceled a ticket. Seats left: %d\n", id, available_seats);
                    }
                    else
                    {
                        printf("Thread %d tried to cancel a ticket, but none were booked.\n", id);
                    }
                }
            }
        }
    }

    // Final output after all threads have completed
    printf("Final number of available seats: %d\n", available_seats);

    return 0;
}

The last code example demonstrates mutual exclusion in a parallel processing context using OpenMP. This concept can be applied to an aviation ticketing system, where multiple threads (representing different agents or customers) might try to book or update ticket information simultaneously. In such a scenario, mutual exclusion is crucial to avoid data inconsistencies and ensure correct ticketing operations.

Execution of the code 10:

gcc -fopenmp openMP_10.c -o openMP_10 && "./openMP_10"

Thread 1 tried to cancel a ticket, but none were booked.
Thread 1 tried to cancel a ticket, but none were booked.
...
Thread 0 booked a ticket. Seats left: 999
Thread 0 booked a ticket. Seats left: 998
...
Thread 7 canceled a ticket. Seats left: 983
Thread 7 canceled a ticket. Seats left: 984
...
Thread 4 booked a ticket. Seats left: 985
Thread 4 booked a ticket. Seats left: 984
...
Thread 4 booked a ticket. Seats left: 918
Thread 7 canceled a ticket. Seats left: 919
Thread 7 canceled a ticket. Seats left: 920
...
Thread 7 canceled a ticket. Seats left: 985
Thread 6 booked a ticket. Seats left: 984
...
Thread 1 canceled a ticket. Seats left: 1000
Thread 1 tried to cancel a ticket, but none were booked.
Thread 1 tried to cancel a ticket, but none were booked.
...
Final number of available seats: 806

Understanding the Output of the Code with OpenMP and Mutual Exclusion

The output from your code represents a simulation of an aviation ticket booking system, where multiple threads are concurrently attempting to book or cancel tickets. Here’s a detailed breakdown of what’s happening:

Thread Behavior and Actions:

The code uses OpenMP to parallelize the operations, with multiple threads (8 in this example) trying to either book or cancel tickets.
Each thread is identified by its thread ID (id), which determines whether it will attempt to book or cancel tickets.

Ticket Booking and Cancellation:

Booking Tickets: If a thread ID is even (e.g., 0, 2, 4, 6), it attempts to book tickets. Booking reduces the number of available seats (available_seats) by one.
Canceling Tickets: If a thread ID is odd (e.g., 1, 3, 5, 7), it attempts to cancel tickets. Canceling increases the number of available seats by one, but only if there are booked tickets available to cancel.

Critical Sections (#pragma omp critical):

Critical sections ensure that only one thread can modify the shared variable available_seats at a time, preventing race conditions.
This is crucial to avoid scenarios where two threads try to book or cancel tickets simultaneously, which could lead to incorrect results (like overbooking or underbooking).

Output Explanation

Thread Output Order:

Thread 0: Initially books several tickets (from 999 to 990 seats left). This is expected as it has the even ID and repeatedly enters the critical section to reduce available_seats.
Thread 4 and 6: Also book tickets as they have even IDs. They operate similarly to Thread 0 but at different intervals.
Thread 1, 3, 5, 7: Attempt to cancel tickets. Initially, some of these threads try to cancel tickets when no tickets have been booked yet, resulting in messages like “Thread 5 tried to cancel a ticket, but none were booked.”
After some tickets are booked by even-numbered threads, odd-numbered threads successfully cancel tickets until no more can be canceled (because all available tickets have been reset to the total limit of 1000).

Thread Messages

“Thread X tried to cancel a ticket, but none were booked.”: This message indicates that an odd-numbered thread attempted to cancel a ticket when all seats were already available (meaning there were no booked tickets to cancel). This check is implemented to prevent errors from cancelling non-existent bookings.
“Thread X booked a ticket. Seats left: Y”: This message indicates successful ticket booking by even-numbered threads, reducing the total available seats.
“Thread X canceled a ticket. Seats left: Y”: Indicates that an odd-numbered thread successfully canceled a ticket, increasing the number of available seats back toward the total.

Thread Synchronization and Competition:

Threads compete to enter the critical sections, which is why you see mixed output. The order of operations depends on which thread acquires the lock first.
Concurrency Control: OpenMP ensures that operations modifying the shared variable (available_seats) do not overlap, maintaining correct data state throughout the program.

Final Output and Result:

The final output indicates the total number of available seats after all booking and cancellation attempts: Final number of available seats: 914.
This result shows that after all operations, there are 914 seats available, meaning 86 tickets were ultimately booked after all the cancellation attempts.

Key Takeaways

OpenMP and Critical Sections: The use of #pragma omp critical ensures safe access to shared resources (available_seats), preventing errors that could arise from simultaneous read-modify-write operations by multiple threads.
Concurrent Programming: The program illustrates a typical use case of concurrent programming, where multiple threads perform operations that must be carefully synchronized to maintain data integrity.
Application in Real-World Scenarios: Such a ticketing system must handle high-concurrency environments gracefully, which is why mutual exclusion (ensured by the critical section) is critical for accurate operations.

Conclusion:

OpenMP is highly advantageous when dealing with larger tasks that can be parallelized effectively. While for small tasks, the overhead might make OpenMP seem less efficient, its real power becomes apparent with more substantial and more complex computational workloads where it can significantly reduce execution time by utilizing multiple cores effectively.

This tutorial provides an excellent demonstration of parallel processing, synchronization, and mutual exclusion concepts using OpenMP.

This tutorial demonstrates the significant performance benefits of using OpenMP for parallel processing compared to sequential execution. By applying OpenMP directives, tasks can be distributed across multiple threads, reducing execution time for large datasets.

We explored how OpenMP improves efficiency in various scenarios, including computationally intensive tasks and practical applications like aviation ticketing systems.

The comparison between sequential and parallel execution highlights OpenMP’s potential for optimizing performance in real-world applications. Overall, OpenMP is a valuable tool for enhancing computational speed and efficiency.

That’s all folks!

See you soon!

👉 GitHub Repo