Parallel Programming Primer II: Multiprocessing and Multithreading

Saurav Dhungana
CraftData Labs
Published in
10 min readOct 10, 2017

In Part I of this series we introduced the basics of parallel hardware and software at a very basic level. It talked about how parallel programming is concerned with the simultaneous execution of multiple computations and how it is enabled by parallel hardware.This post builds up on those concepts to talk about the most prevalent multi-core programming paradigms — multiprocessing and multithreading. In order to do so, we will introduce processes and threads in the context of an Operating System (OS) via low-level system calls and also look at some of their shortcomings.

It is important to point out that this post is concerned with utilizing all the cores in a computer, and not other use cases of concurrent programming. As such, the term concurrency should be understood in this context of true parallelism in multi-core processors.

1. The Linux Programming Model

In the broadest terms, an OS is defined as consisting of a core software managing the computer’s resources and all of the basic standard software tools, such as command-line interpreters, graphical user interfaces, file utilities, etc that are distributed with it. The core piece of software that consists of the critical OS routines to control the shared physical resources in a computer like the memory, processors and I/O devices is called the kernel.

GNU/Linux is an open-source, Unix-like OS that is hugely popular today. The name Linux actually refers to just this kernel, while the GNU system provides all the other essential building blocks of a fully functional Linux distribution. Hence, the name GNU/Linux. It runs everywhere from the largest servers to the tiniest embedded devices.

Given these facts and our own experiences, we chose to use the Linux programming interface to explore these ideas. Having said that, the details should be almost the same for macOS (which is also a Unix-variant) and Windows should also have equivalents.

Kernel space and user space

There are two main modes in which modern CPUs operates — user mode and kernel mode. As the names suggest, ordinary program code runs in user mode, and the core OS components run in kernel mode. The virtual memory thus allocated in these modes are called as being part of the user space or kernel space respectively.

In the user mode the CPU can only access memory in the user space while in the kernel mode it can access both memory spaces. Limiting the user mode programs prevents it from altering, and possibly damaging, critical OS data and routines, and thus preventing catastrophic failures. The processor switches between the two modes via hardware instructions depending on what type of code is running. The diagram below illustrates the two modes.

The GNU/Linux Programming Model

The Standard Library

Since, only the kernel space code has direct access to the underlying hardware and user space code does not, the kernel makes a range of services accessible to the user via controlled entry points called system calls. They originate in user space and terminate in kernel space. In Linux, system calls are mainly carried out by a C API using the libc or glibc library. This is also known as the standard library. Not all standard library functions perform a system call.

2. Multiprocessing Basics

A process is essentially an running instance of a program. When a program is executed, the kernel loads the code of the program into virtual memory, allocates space for program variables, and sets up data structures to record various information about the program. So, from the kernel’s point of view, processes are the fundamental entities that it allocates the system resources to. Linux uses what is called pre-emptive multitasking. In this system, the rules governing which processes receive use of which CPU core and for how long are determined by the kernel process scheduler, rather than allowing processes to start and stop themselves.

When the OS allocates its resources, each of the processes are isolated from one another and from the kernel. This way one process can’t read or modify the memory of another process or the kernel. They might even be in different CPU cores, whereby running simultaneously. Some processes, however, will need to cooperate with each other using some means of communicating with one another and synchronizing their actions. For this, Linux provides a rich set of mechanisms for interprocess communication (IPC) via signals, sockets, pipes, semaphores etc. Let’s have a look at how that works.

Linux IPC via named pipes

We will now briefly go through a simple use-case of IPC in Linux using unnamed pipes. This is the oldest IPC mechanism available in the Unix family and is widely used even today.

Pipes are are a familiar concept to anyone who has used shell commands before. This is really useful when you want to use the output of one command as the input for another. A commonly used example is give below.

$ ls | wc -l

This command calculates the number of files in a directory. This happens by adding a “|” (pipe) after the ls, so that the list of files from its output is used as the input of the wc command.

In order to execute the above command, the shell creates two processes, executing ls and wc, respectively. The mechanism of how this works can be seen in the figure below.

Using pipes for IPC

We can mimic this command using the standard library functions pipe() and fork() in glibc. Both of these functions are available in the unistd.h header that we include at the top of the program.

The parent process (program) creates multiple child processes for each one of the command-line argument ls and wc. The fork() system call is used to create new processes. In order to keep the processes synchronized, the parent builds a pipe using the pipe() function before creating the child processes.

Each child inherits a file descriptor pipeFDs for the read and write end of the pipe and closes this descriptor once it has completed its action. The parent waits until all children have completed their actions.

From a data processing point of view this would be useful when we want to create data pipelines. In a data piepline the tasks are usually performed so that the output from one stage is fed as the input of another. There will also be tasks that can be run in parallel, for which we can create many child processes for those tasks without need for IPC between them.

3. Multithreading Basics

A thread is a sequence of executable instructions within a process. We can think of threads as a set of light weight processes that share the same virtual memory. Each of these threads executes the same program code and with shared global variables and heap, but each has it own local variables and function call stack. The threads in a process can execute concurrently, and threads can execute in parallel on a multi-core processor multiple.

The primary advantage of using threads is that they make it easy to share data (via global variables) between cooperating threads. Additionally, the cost of switching between different threads in the same process is almost zero as compared to switching between processes, which is significantly more expensive.

A multithreaded application can also more transparently take advantage of the possibilities for parallel processing on multi-core processors. This however comes at a price of introducing possible issues. The programmer not the OS is responsible for co-ordinating the threads in the same process and make sure that two threads don’t inappropriately modify the shared data. Too little co-ordination causes unpredictable behavior of the program due to race conditions and too much can cause deadlocks that stalls the program for long periods of time.

Coding with pthreads

From an implementation point of view, threads are created using the POSIX compliant pthread library in Linux. There is no actual concept of a thread in kernel space in Linux. It is strictly a user space construct. Internally, the clone() function is used to create a normal as well as light weight processes. This means that to create a normal process, a fork() function is used. It will further call the clone() function with appropriate flags to performs the actual creation. To create a thread or light weight process, a function from pthread library calls the clone() with relevant flags. So, the main difference is generated by using different flags that can be passed to clone() function.

Lets look at simple example where we use threads to calculate the value of π. Though this particular method isn’t the most efficient way to calculate π, it does lends itself pretty well to show parallelization via pthreads work.

The basic idea for this stems from the trigonometric formula 1 = tan π/4, from which we can obtain π = 4.arctan(1)

The Maclaurin series for arctan(x) is given by:

Which can then be used to compute the approximate value of π as:

The terms in the summation are independent of each other can be divided into evenly sized chunks for each of the threads. Each thread them calculates the partial sums and adds them up in the shared global variable sum. The code below provides one way of doing that using pthreads. Ignore the lines with mutexes for now. They will be explained in the next section. Also, remember to compile this code with a -lpthread, which tells the compiler that we want to link in the pthreads library.

The pthread.h provides us with various functions and variables for using pthread functions. The program takes the number of threads and number of terms n for the series as arguments from the user. The sum variable is declared as a global variable and it then creates one pthread_t object for each thread.

The pthread_create() function then starts each of the threads using the pointer to the pthread_t object and function PI_calc() that the thread needs to run as arguments. Where these threads are run is determined by the OS, depending upon the availability of cores in the machine when the program is running. Each of the threads will update the global sum variable independently to give us the total sum. The pthread_join() function is used by the main() thread to wait for the other pthread_t objects to complete.

When we ran this program in a 4 core machine MacBook Pro laptop using n = 10⁸ and 2, 4, 8 and 16 threads we got a 1.9x, 3.6x, 5.1x and 4.9x speed-up respectively, over the serial version. The decrease in speed-up after 16 threads may be because the overhead of creating and handling new threads outweighs the advantage of running the partial sums in parallel.

Thread Safety and Locks

We saw the power that threads give us by allowing us to run many computations at the same time. But, since the user is responsible for how the shared memory is accessed by various threads, it can lead to problems when more than one thread tries to read or write from a shared global variable at the same time. This creates unpredictable results that may give us an incorrect result. This is called a race condition. The most common place for a race condition is a critical section — a block of code that can only be executed by one thread at a time.

In order to produce the correct result we need to make sure that once one of the threads starts executing the statement, it finishes executing the statement before the other thread starts executing the statement. A function or a variable is said to be thread-safe only if it can be safely invoked by multiple threads at the same time.

There are various methods of making a function thread-safe. One way is to associate a mutual exclusion lock or mutex with the function or global variable. What is does is to restrict or lock access to the variable to a single thread. Thus, a mutex can be used to guarantee that one thread “excludes” all other threads while it is accessing the global variable. The threads are said to be synchronized, thus, effectively making that part of a the code serial.

A mutex is precisely what is being code in our code example above. We determine the critical section of the code to be in the PI_calc() function. Specifically, the line that adds the local sum value of each thread my_sum to the the shared global variable sum. We acquire and release the mutex only during the execution of that particular line.

Now, that we have successfully looked at how OS-level threads and processes are programmed, this should be a good starting point for understanding how they work. Manually managing them like this in C, however, can quickly get cumbersome in larger projects. Even with careful application of locks, we can still get problems like deadlocks. There are other C/C++ APIs such as OpenMP and OpenMPI that provide higher level abstractions for shared memory and message passing respectively. Anyone who needs to use low-level threads and processes in non-trivial use-cases is advised to look into these two libraries.

In the next part we will be looking at the concurrency model in languages like Java, Python and Go. These provide high level data types and constructs that make working with threads and processes easier, but do so in fundamentally different ways.

Thank you for reading.

--

--