The Rise of Parallel Processing: Introduction to GPUs (Part 1)

Yogesh Chhabra
5 min readMar 1, 2019

--

source:https://coinnounce.com/list-of-the-best-cryptocurrency-to-mine-with-gpu/

The need for Parallel Programming.

Why should we go for Parallel Programming is a valid question that I would like to answer with the following scenario:

You are to organize a football tournament of 1024 teams. Consider the following ways to set the fixtures:

1.) Match between team 0 and team 1(Yes, counting always starts with 0 for a programmer); the winner plays against the next team, i.e. team 2, the winner of which plays against the next team. How many matches did it take to get the best football team?

2.) Divide the 1024 teams into 512 pairs of 2 teams each. In the first round the teams in a pair compete against each other( parallel), winners advance to the next round. Again form 256 pairs and teams play parallel until you get a single winner. How many matches did it take this way?

source:https://www.iconfinder.com/icons/2836175/fixture_fixtures_group_match_stage_table_icon

How would you like to create the fixture?

The first method takes 1023 matches. The second one takes 512 matches in the first round, 256 in the next, then 128 and so on. It also comes out to be 1023.

We all know that method two is used in real life scenarios but if the number of matches is the same then why do people prefer method 2? Here we can see the advantage of parallelism, all the 512 matches in the first round were played simultaneously, thus effectively consuming the time of a single match. So in total, the tournament takes time equivalent to that of log(1024) = ten matches.

Using parallelism the time decreased from O(N) to O(log(N)). Now, that’s cool, right? The next question is, can you do it always? Think about it for a while. The answer is No. The key factor here is the number of football grounds that you have, which will limit the number of matches that can be played in parallel.

Let’s assume that we have only 32 football grounds. How many rounds will be required now? Let’s calculate this.

So I can have 32 matches in parallel, i.e. 64 teams playing in parallel giving 32 winners. So the previous round 1 that had 512 matches and took time equivalent to 1 match, now takes time equivalent to (512/32)=16 matches, similarly round 2 takes time equivalent to 8 matches, the next takes 4, what about the round with that has 16 teams, it takes time equivalent to 1 match(some grounds will be vacant). The total time = 16(512 matches) +8(256 matches) +4(128 matches)+ 2(64 matches) +1 (32 matches)+1(16 matches)+1(8 matches)+1(4 matches)+1(2 matches)+1(1 match) = time taken by 37 matches.

The general solution for N teams will be N/32 +5.

Hence, we cannot reduce the time logarithmically always, but we can decrease it by some factor. So what we needed here was an enormous number of football grounds.

That’s the idea behind a GPU, an enormous number of processing cores working in parallel.

Parallel vs Sequential processing.

source:http://www.itrelease.com/wp-content/uploads/2017/11/Sequential-processing-vs-parallel-processing.jpg

But wait a minute, this example was so simplified, what if I was able to reduce the time of a football match?

Let the things turn a bit technical now; if I have to execute a program. I can work to make the execution of the program faster, or I can divide the program into smaller programs that can run in parallel or, I can use a combination of both.

In the development of processors, the aim was to make execution faster and faster on a single core for a long time, until people realized that they could not go beyond a certain speed. Thus “Parallelism” came to rescue, people started adding more cores on a processor and began to divide tasks on cores that run in parallel.

Still, CPU designers aim to make execution of a single thread faster, so they use costly, but high-performance microcircuits(whatever be the area it takes). The aim of a GPU designer is to handle more and more threads in parallel, so the area and cost of each core is an important factor here. That’s why early GPUs didn’t have multiplication operations.

History of GPUs

As the name suggests, the Graphics Processing Unit was initially designed to calculate the colour values of pixels on the screen; there are a massive number of pixels on a screen and a huge scope of parallel calculations. Soon people realized the power of parallel processing and started using it for General Purpose Calculations. People started designing algorithms with parallelism. As the initial aim of GPU was to calculate colour values of pixels, programmers had to send the input data as textures and received the output result as RGB pixel values, and used to extract their results from them.

As the usage increased, GPU producers started making APIs for General purpose calculations. CUDA, OpenCL are examples of such APIs.

Now we are in a condition to understand what a GPU is; a GPU is a processor with a large number of cores, which are used mainly for a large number of floating point operations which can be computed in parallel. The term Giga Floating point operations(GFLOPS) is used for the number of floating point operations a GPU can do in a second(in Gigas).

Hiding the latencies

As I previously mentioned, a CPU core is generally much powerful(faster) than a GPU core(unless you are you are using your Grandpa’s old computer), so the next question that comes to mind is that is it really more efficient to divide a program into parallel subprograms that run on slow cores rather than running the program on a single faster core?

The answer is ‘it depends’. If you can divide the program into a number of parallel threads(much more than the number of threads that can run parallel on GPU at a given point of time), it’s better to use GPU.

Let’s try to understand the reason for the above behaviour. The scenario is that the number of threads is much more than the number of threads that can be executed parallelly at a given point of time. In such cases, the GPU can hide the cost of long latency operations. The cost of an operation means the time(or rather the number of cycles) it takes. The memory access operations are called long latency operations, and they are the bottleneck for performance in any processor. But the GPUs are really smart, whenever a thread starts a long latency operation the GPU leaves that thread and starts executing some other thread. The thread is again picked up after the long latency operation is completed. This way if there are a large number of threads, then cost hiding makes the execution really fast. Please note that the above statements are not completely correct but correct enough to understand the basic idea. We’ll see more on latency hiding after we discuss the organization of threads in a GPU in part 2.

--

--