Parallel Programming Primer I: Fundamental Concepts

Saurav Dhungana
CraftData Labs
Published in
10 min readSep 9, 2017

Being a new data science company, we at CraftData Labs often have much smaller budgets to work with on our projects than big established companies. As a result, we have learned to make the most of the resources without incurring huge server costs. Even with the proliferation of distributed data processing, we cannot forget that with multi-core processors we can run many tasks in parallel within a single computer itself.

This article is the first in a series called Parallel Programming Primer where we will share some of our hard earned experiences parallelizing our data processing tasks in hopes that it will help others who encounter similar situations. In this first part we introduce the basic concepts of parallel computing and the parallel programming model. This should nicely lay the foundation for the rest of the series and provides the necessary background to understand them.

1. Parallel Hardware

It’s no secret that computer processors aren’t getting much faster anymore and the age of multi-core processors has been upon us for a good part of the last decade. From the largest supercomputers to the tiniest embedded devices these multi-core processors are everywhere. This has caused a paradigm shift from the traditional sequential way in which programmes were written. Most new programming language these days will have a fairly sophisticated concurrency model built-in to take advantage of all the cores available, while more established languages are also introducing new features to keep up with this trend. We first look at the type of hardware available these days that allow parallel programming.

A Multicore world

An Intel Quad-core processor

There are different types of parallel hardware available in the market as the trend for more and more cores continues, ranging from general to highly specialised ones. We explore some of the most important ones here.

Multi-core processors are the most common processor we encounter that are found in servers, desktops, and mobile devices. They usually contain 2 to 8 independent processing units called cores on the same chip. Processors with more than 8 cores are also available and they are usually referred to as Many-core processors and are used in embedded and High Performance Computing (HPC) systems.

Symmetric multiprocessors (SMP) have multiple identical processors grouped in powers of 2, that share the main memory and connect to it by an I/O bus, thus acting as a single processor. These multiple execution groups are not on the same chip but each one itself can contain multi-core processors.

General Purpose Graphics Processing Units (GPGPU) are special purpose processors that act as a co-processor to the main processor unit, and were originally intended for graphics processing. They don’t execute all user programs by default but can execute a program when this is explicitly requested by the host processor through APIs such a Nvidia’s CUDA.

Field Programmable Gate Arrays (FPGA) are also co-processors but they can rewire themselves for a given task. This leads to better performance as they can be optimised for a particular application. They are usually programmed in hardware description languages (HDL).

Distributed clusters are groups of computers connected via a network that do not have a common shared memory. Each computer is called a node and a node may need to cooperate with other nodes to solve a problem.

We are mostly focussed on multi-core processors on a single computer here, but there will also be some coverage of GPUs and distributed systems in later articles.

Flynn’s Taxonomy

To understand parallel computing better, a classification scheme for computer architecture known as Flynn’s Taxonomy is commonly used. This system classifies a computer based on the number of instruction streams (S) and data streams (M) it can manage simultaneously. An instruction stream is the set of instructions that makes up a process, and a data stream is the set of data to be processed. We explore the most relevant ones for us here.

Single Instruction Single Data (SISD) systems are equivalent to the classical sequential von Neumann architecture that execute a single instruction at a time and can fetch or store one data item at a time.

SISD Architecture

Single Instruction Multiple Data (SIMD) systems are present in most current parallel systems such as GPUs and desktop CPUs, where the same instruction is run on multiple data items at the same time. The different processing units work synchronously and are there much easier to implement. These are ideal for parallelising simple loops that operate on large arrays of data, which is accomplished by dividing data among the processors and having the processors apply the same instructions to these subsets.

SIMD Architecture

Multiple Instruction Multiple Data (MIMD) systems support multiple simultaneous instruction streams operating on multiple data streams. These consist of a collection of fully independent processing units or cores that work asynchronously.

MIMD Architecture

MIMD is more flexible than SIMD and thus more generally applicable, but it is inherently more expensive than SIMD. Hence, MIMD is usually popular parallel architecture in use for SMPs and distributed clusters that are used in more specialised use-cases. You may also find these further classified into SPMD and MPMD systems.

2. Parallel Programming Model

With the basics of the parallel hardware explained, its time to learn how they are programmed. Although processors have had some form of parallelism built-in for a long time, due to the traditional way of learning programming our mental models have been wired to think sequentially.

Getting used to parallelising the problem and debugging it is far from trivial. But, because of the reasons alluded to earlier, it has become imperative to learn parallel programming in order to get the maximum efficiency from our hardware. Having said this, it might not always be a good idea to do so and the last section of this article goes into the limitations of speed-ups we can expect from writing parallel version of codes.

The coming section introduces the most important concepts one needs to get started with parallel programming.

Concurrency and Parallelism

Before going any further it is important to differentiate the two terms Concurrency and Parallelism. Most people tend to use them interchangeably, which often leads to misconceptions about what parallel programming is. They are both closely related concepts but aren’t quite the same.

Credit: Joe Armstrong (co-author of Erlang Programming Language)

Concurrency is a much broader, general problem than parallelism. It is concerned with the handling of more than one task at the same time by a program to increase its responsiveness. This doesn’t necessarily require parallel hardware and can happen by interleaving the execution steps of each task via time-sharing slices. An example is how the OS GUI allows multitasking by running many applications at the same time.

Parallelism one the other hand is concerned with running more than one task simultaneously on a multi-core processor to increase the speed of programs. One of the reasons to do so may be to achieve concurrency but not necessarily so. For example, if you want to convert a large number of colour photos into B/W, you can simultaneously run the B/W filter on 4 photos on a quad-core processor by batching them up in groups of four.

Types of Parallelism

As parallelism is the main topic of this series, we will go more in-depth to see its different types. Parallel processing requires splitting up the data to be handled, and/or the task itself. Based on how this is done, there are two main types of parallelism.

In Data Parallelism we partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data. This type of problem is pretty well understood, and is used widely in HPC. For example, if we want to square all the integers in an array, we can divide parts of the data between different parallel units and perform the squaring operations in parallel in each.

Data Parallel Problem

In Task Parallelism instead of dividing up the data and doing the same work on different processors, we divide the operations to apply in different parallel units. Hence, different tasks are applied to the same data at the same time. For example, if we want to get several aggregations from the same data like sum, average, minimum, maximum etc. This is possible since there are no dependencies between the tasks, and so they can run in parallel.

Task Parallel Problem

Each of these types require different program decomposition and aggregation schemes, and it is important that we have a clear idea of which type of problem we are trying to optimize.

Additionally, tasks are often classified based to how often their subtasks need to synchronize or communicate with each other. A task is said to exhibit fine-grained parallelism if its subtasks must communicate several times per second. It exhibits coarse-grained parallelism if they do not communicate many times per second, and it exhibits embarrassing parallelism if they rarely or never have to communicate. Embarrassingly parallel tasks are considered the easiest to parallelize.

Processes and Threads

Different models of program control flow

Parallel programs are executed by multi-core processors such that one or multiple control flows are executed on each processor. Depending on their coordination, these control flows are referred to as processes or threads. In the simplest terms, a process is an instance of an executing program.Threads are generalizations of the process concept and are sometimes also referred as a subprocess. A process can consist of several threads which share a common address space whereas each process works on a different address space.

Which of these two constructs is more suitable for a given parallel computing problem depends on the physical memory organization of the execution environment. Processes are usually more suitable for distributed memory systems whereas threads are typically used for shared memory systems.

Another factor that dictates which model we choose is also the restrictions in the design of programming language we are using. Statically-typed compiled programming languages like C++, Java and C# primarily use OS threads, while due to restriction such as the infamous Global Interpreter Lock (GIL), dynamic interpreted languages like Python and Ruby tend to favour processes to utilize all the available cores.

Amdahl’s Law

Reading about the promise of parallel programming its easy to start expecting a N-times speedup of your code in a N-core processor. This however isn’t the case, even for embarrassing parallel problems where there are no communication overheads between the tasks.

The reason can be seen from what is called Amdahl’s Law. It states that, for a given operation if F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelized, then the maximum speed-up that can be achieved by using a P-core processor is given by the formula:

This means if we consider an operation 90% of which can be parallelized, the speedups that we get by running this on a 4-core, 8-core, 20-core and 1000-core processor are 3.1, 4.7, 6.9 and 9.91 respectively.

We can see that we get diminishing returns as the number of cores increases (1000 cores only gives a speed-up of 9.91). Simply throwing in more cores isn’t always the answer. In fact, even if virtually all parts of a task can be parallelized, the possible speedup is going to be very limited — regardless of the number of cores available to us. The actual speed-up is going to be even less since Amdahl’s Law is the theoretical maximum.

Thus, anyone who is writing parallel code must have this knowledge to avoid having unrealistic expectations of what parallelizing an operation can achieve. In many cases the added complexity for writing parallel code may render the speedup we gain useless and can even lead to incorrect behaviour. A general rule of the thumb is to first create a “correct” version of your code, and only then convert that into its parallel equivalent if it is worth doing so.

This brings us to the conclusion of this first part. In Part 2 we will see how parallel programming is achieved using OS-level processes and threads in Linux. In later parts we will also be looking at how higher abstractions over these processes and threads are used in languages like Java, Python, Go and functional languages like Erlang and Clojure. Stay tuned for that and much more.

Thank you for reading !

--

--