Parallel Processing and Programming

Kristoffer Hebert
Sep 15, 2019 · 12 min read

A summary of what I have learned over the years.

Introduction

My background in Parallel Processing and programing comes from taking hobby programming projects, online Java programming classes and completing two Udemy courses on the subject of Parallel Computing. I have designed a Node.js program that takes Youtube videos and burns subtitles onto the video. My program splits up the number of videos to be processed into tasks which are divided programmatically depending on how many cores the computer has. The tasks are completed in parallel and provide significant throughput compared to processing the videos one by one. I utilized my knowledge of Java from CS101 and applied to two Java programs for my Udemy courses. The first program calculates the sum of an array of integers both sequentially and in parallel. The program splits up the array into subproblems based on the number of cores on the computer. The second program sorts an array of integers in both sequential and parallel. After both programs run they display the time it takes to compute in milliseconds.

Define terminology commonly used in parallel computing

Task
“A logically discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor” (Krastev 14). An example of a task would take the number of cores and dividing an image into sections and then processing each section in parallel.

Parallel Task
“A task that can be executed by multiple processors safely (yields correct results)” (Krastev 14). For example, computing the sum of an integer array by dividing it into subproblems or tasks. The order in which each subproblem gets computed has no effect on the final result since the order has no effect in an addition problem.

Pipelining
“Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing” (Krastev 14).

Speed up
“The speedup is defined as the ratio of the serial run time of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on processors” ( Krastev 7).

The formula for speedup
1/((1−𝑃)+𝑃/𝑛))

A general rule of thumb is that the number of tasks should match the number of processor cores. For example, my home computer has an Intel i5 CPU, which has four cores, therefore the sweet spot for tasks would be four. Exhibit D, my parallel sum program had an enormous increase in throughput. I spent some time debugging it and output values as it did the computations, I couldn’t believe how fast it was compared to the sequential version,

Amdahl’s Law
“The speedup of a program using multiple processors in parallel computing is limited by the time needed for the serial fraction of the problem. Amdahl’s Law implies that parallel computing is only useful when the number of processors is small, or when the problem is perfectly parallel,i.e., embarrassingly parallel” (Zang 12–13). My programs for burning subtitles into videos and calculating the sum of integer array are both problems that were embarrassingly parallel. The order in which the tasks completed bared no impact on the results or correctness of the program.

Parallel Overhead
The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead can include factors such as (Krastev 15):

  1. Task start-up time
  2. Synchronizations
  3. Data communications
  4. Software overhead imposed by parallel compilers, libraries, tools, operating systems, etc.
  5. Task termination time

Parallel Overhead is the main reason why the number of tasks should match the number of cores. There is a drop off in throughput of the program due to the five items previously described, when more tasks are added to the same amount of cores. I ran into problems with parallel overhead during my parallel quicksort algorithm assignment, Exhibit D. There is a significant amount of overhead when starting up new threads, the small computing work involved with sorting an array of integers does not justify that parallel overhead. The parallel quicksort was slower than sequential quicksort. I later learned about Java forks and joins, however, was not aware until after finishing the assignment. It took 1 millisecond to process and sort an integer array sequentially, and it took 12 milliseconds to process the same array, making it 12 times slower. I’m sure there’s some code optimizations I could make. Maybe the creation of new arrays for partition numbers could be improved. However, I don’t believe any parallel quicksort algorithm, that uses threads in Java, would be faster than a sequential one due to Parallel Overhead.

Scalability
Refers to a parallel system’s (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that contribute to scalability include (Krastev 15):

  1. Hardware — particularly memory-CPU bandwidths and network communications
  2. Application algorithm
  3. Parallel overhead related
  4. Characteristics of your specific application and coding

One hardware example of this would be to increase the throughput of a crypto-miner, which uses a video card GPU to solve math problems that complete transactions on the block-chain, by adding additional video cards. Another hardware example would be, If additional sockets are available on a motherboard for CPUs, adding more would increase the number of requests that could be handled in parallel.

Synchronization
“The coordination of parallel tasks in real-time, very often associated with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point. Synchronization usually involves waiting by at least one task, and can, therefore, cause a parallel application’s wall clock execution time to increase” (Krastev 15). When writing code in Java, the “thread,join()” acts as a point for the project to synchronize with the rest of the code. I used “promise.all” method in Node.js to wait until all parallel processes completed burning subtitles into videos before continuing with the rest of the program.

Supercomputing /High-Performance Computing (HPC)
Using the world’s fastest and largest computers to solve large problems. (Krastev 16). The most famous computer is IBM Watson because it appeared on the game show Jeopardy. According to nytimes.com, “Watson, specifically, is a “question answering machine” of a type that artificial intelligence researchers have struggled with for decades — a computer akin to the one on “Star Trek” that can understand questions posed in natural language and answer them” (Markoff). Watson would continue to defeat my legendary jeopardy players such as Ken Jennings and Brad Rutter. Another IBM example of a supercomputer would be the chess computer Deep Blue. According to theconversation.com, “In defeating Kasparov on May 11, 1997, Deep Blue made history as the first computer to beat a world champion in a six-game match under standard time controls” (Morgan/Reuters).

Describe different parallel architectures, interconnect networks, programming models, and algorithms for common operations

What are some different parallel architectures?

Why would you use Parallel architecture
The cost of CPU performance goes up exponential as processor speed goes up. The alternative would be to use multiple CPUs in parallel to increase compute throughput per dollar spent. Motherboards typically only have one socket for one CPU, however, some motherboards utilize multiple sockets for multiple CPUs such as Xeon Processors.

Examples of parallel architectures?
Pipeline architecture breaks a task into phases which can be handled in parallel by cores, like an assembly line, where tasks are processed and completed, then moved down the line to the next phase. If I were to apply pipeline architecture to my Node.js video process, I would add the ability to generate thumbnails, captions, and upload video to YouTube. I would break up the program into the phases such as video processing, thumbnail generation, voice transcription, and then uploading. I have four physical cores on my computer and each core could be dedicated to each phase.

A real-life example of pipeline architecture would be going to Subway and ordering a sandwich. Imagine a line of customers waiting for sandwiches. There’s two or three sandwich makers and one register. The customer sandwiches are made in parallel. There are multiple phases for sandwiches. The first stage meat is added to the bread, the second stage veggies are added and then the third and final stage, customers pay for their food at the register. Multiple customers get their sandwiches made in parallel as work on sandwich phases are completed and passed down the line. Pipeline architecture helps increase the throughput of sandwiches per hour vs sequential processing.

Crypto mining using Graphics Processing Unit (GPU)
GPU often have many cores that are used to mine cryptocurrencies by solving complex mathematical problems. The reason we use the GPU over a CPU is that CPU is like the CEO of a company. The CPU’s role is to oversee the big picture of what needs to be computed and the schedule and prioritize tasks and resources as needed. A CPU in a sense is optimized for multitasking. A GPU, on the other hand, is optimized to do the tasks as they are assigned to. It is optimized for single repetitive tasks, like video processing, for raw throughput on a single piece of work. GPUs are the low skilled but numerous laborers of the company. GPUs are better suiting for the complex mathematical computation that goes into crypto-mining.

What are inter-connect networks?
Inter-connect networks are networks between computers, which can either be a group of CPUs or memory. They act as a highway that transports data between memory and CPU during the processing of tasks that happen simultaneously. Inter-connect networks can be between CPUs inside of one computer or a group of computers that are connected. They are useful because some problems have data sets are so large that they cannot be computed effectively by one CPU or memory set. Inter-connect networks allow groups of processors to connect together, thus increasing computing power in cost and time-effective manner. Another advantage is that more memory and CPU can be added as needed.

For groups of computers connecting them via inter-connect networks becomes increasingly complex, in both the complexity of connections and cost. Connecting three or four computers together is a trivial task, however, imagine the connections that would be required to connect a hundred computers or a thousand computers together. Two approaches to reducing the complexity of this problem are mesh networks and hypercube networks.

Mesh network
If you were to visualize a mesh network, you could draw four lines in perpendicular and parallel so that they intersect with each other, similar to how a checkerboard looks. Each point of intersection is where the node would reside, with four connections other nodes. The benefit of mesh networks is it easy to increase the number of nodes inside of a network as needed.

Hybercube network
Hypercube networks are nodes arranged like points inside a three-dimensional cube, where each point on the node connected to three other nodes.

Algorithms for common operations

Quick sort on hypercube

In Exhibit D, I designed a program that implemented parallel quicksort with Java threads. If I wanted to scale up throughput with additional computers, I could apply the hypercube interconnect network to quicksort.

  • Split array into subproblems amongst nodes in a hypercube
  • Sort each subproblem local within a node
  • Use the first node to broadcast the pivot integer
  • Partition each list in each node, then swap the left and right arrays across the nodes
  • Repeat the last two steps you reach the first node

Parameter study problem
“In so-called embarrassingly parallel problems, a computation consists of a number of tasks that can execute more or less independently, without communication. These problems are usually easy to adapt for parallel execution. An example is a parameter study, in which the same computation must be performed using a range of different input parameters. The parameter values are read from an input file, and the results of the different computations are written to an output file” (Foster). My Node.js project, Exhibit C, and Java sum problem, Exhibit D, are embarrassing parallel because the order of completion of the workers or threads does not matter.

My Node.js program downloads multiple videos and subtitles and then uses FFMPEG, a graphics library, to burn the subtitles into the video. This means that subtitles show up on the bottom of the video and cannot be turned off. The output of the program can be seen below, notice how some cores complete out of order in the parallel version.

$ node index.js
Getting subtitles…
Number of Cores: 4
Starting…
Processing video 1 of 4
Processing video 2 of 4
Processing video 3 of 4
Processing video 4 of 4
Completed All
Execution time Single Core Processing: 57.16 Seconds
Number of Cores: 4
Processing video 1 of 4
Processing video 2 of 4
Processing video 3 of 4
Processing video 4 of 4
Starting…
Completed: 3
Completed: 1
Completed: 4
Completed: 2
Completed All
Execution time Parallel Processing: 22.24 Seconds

Develop an efficient parallel algorithm to solve a problem

Parallel summation of integer array

Imagine an array of integers. How would one add all the numbers in the array to get a total value of all the integers to combine? One approach is to iterate every number and add it to a result variable and return the result variable when you reach the end of the array. While this works for small collections of integers, however this not performant for large collections of integers. For example, if you want to calculate the total amount of money traded on the stock market. A single process using one core on one computer could not process billions of transactions.

The parallel approach would be to apply the divide-and-conquer algorithm to the arrays of transactions. Blelloch and Mags describes it as, “A divide-and-conquer algorithm splits the problem to be solved into subproblems that are easier to solve than the original problem, solves the subproblems, and merges the solutions to the subproblems to construct a solution to the original problem” (Blelloch and Maggs 18). You can divide the arrays into smaller subproblems based on the number of cores inside of the computer, compute all subproblems in parallel and then combine the results of the subproblems to get the sum of the values inside the entire array. One step above that would be to use multiple computers running the same program on different subproblems of all stock market transactions that happened that year.

Pseudo Code

Sequentially Sum :

  • Add all of these numbers would be to store the first value in a variable
  • if the next index in the array contains an integer, add it to the variable.
  • Repeat until there are no indexes left in the array
  • Return the variable

Parallel Programming Sum

  • Split up a number array into sections based on the number of items in array divided by the number of cores
  • Spin up new threads that iterate over the number of items in each section and return the sum
  • Combine the sum of the works into a value
  • Return variable

I developed a single computer program that applies divide and conquer algorithm to the sum problem, also known as tasks, into my homework assignment in Exhibit B. My knowledge from Exhibit A about Java threads helped me with this assignment since both classes were in Java. Exhibit D is the Github repo that contains the full code. I wrote a Random Int class that generates an array of 100,000,000 of integers. I start the timer and sort this array using a sequential sum function which iterates over every integer in the array and returns the sum of all the numbers. I then end the timer, and this program takes about 90 milliseconds to complete. I repeat a similar process again but using my parallel class instead.

The parallel version is about 22 times faster on my i7 Macbook Pro. The parallel way is to take the number of cores in my CPU and divide the random integer array into subproblems. I have eight virtual cores, which 12,500,000 integers per processor. Each core runs the sequential sum function in parallel, and then once it is complete, I combine the results of each subproblem into the total value of all integers in the random integer array. You can find the rest of my code on Github, but here is a code sample of my array sum in Java.

// Main.java

import pack.RandomInt;
import pack.Sequential;
import pack.Parallel;

public class Main {
public static void main(String[] args) {

int cores = Runtime.getRuntime().availableProcessors();
RandomInt numbers = new RandomInt(100000000);

long begin = System.currentTimeMillis();
Sequential seq = new Sequential(numbers.get());
long end = System.currentTimeMillis();
System.out.println(“Sequential Total: “ + seq.total() + “ Completed: “ + ((end — begin)) + “ milliseconds”);

long pbegin = System.currentTimeMillis();
Parallel par = new Parallel(cores);
long pend = System.currentTimeMillis();
System.out.println(“Parallel Total: “ + par.total(numbers.get()) + “ Completed: “ + ((pend — pbegin)) + “ milliseconds”);

}
}

Terminal Output

java Main
Sequential Total: 804629830 Completed: 45 milliseconds
Parallel Total: 804629830 Completed: 2 milliseconds

Exhibit Summaries

Exhibit A
Udemy Certificate of Completion — Java Multithreading, Concurrency & Performance, from April 2019.

Exhibit B
Udemy Certificate of Completion — Multithreading and Parallel Computing in Java, from June 2019.

Exhibit C
Screenshot of a Github repo that contains code example for parallel video processing using FFMPEG and Node.js, from May 2019.
https://github.com/KristofferHebert/node-ffmpeg-video

Exhibit D
Screenshot of Github repo that contains code example for sum problem in Java, from May 2019
https://github.com/KristofferHebert/parallel-algorithms

Exhibit E
Screenshot of Final Grade for CS101 from Saylor Academy, from November 2016
https://github.com/KristofferHebert/parallel-algorithms

Bibliography

Markoff, John. “Computer Wins on ‘Jeopardy!’: Trivial, It’s Not.” The New York Times. https://www.nytimes.com/2011/02/17/science/17jeopardy-watson.html, 16 Feb. 2011. Web. 9 June 2019.

Morgan/Reuters, Peter. “Twenty years on from Deep Blue vs Kasparov: how a chess match started the big data revolution” The Conversation. https://theconversation.com/twenty-years-on-from-deep-blue-vs-kasparov-how-a-chess-match-started-the-big-data-revolution-76882, 17 May 2017. Web. 9 June 2019.

Krastev, Plamen. “Introduction to Parallel Computing.” Http://Sci.sdsu.edu, San Diego State University, https://sci.sdsu.edu/~johnson/phys580/Parallel_computing1.pdf.

Zhang, Jun. “Parallel Computing Chapter 7 Performance and Scalability” https://cs.uky.edu, University of Kentucky, https://cs.uky.edu/~jzhang/CS621/chapter7.pdf

Blelloch, Guy E and M. Maggs, Bruce. “Parallel Algorithms” https://cs.cmu.edu, Carnegie Mellon University, www.cs.cmu.edu/~guyb/papers/BM04.pdf

Abram, Greg. “Parallel Visualization” https://cac.cornell.edu, University of Texas at Austin, https://www.cac.cornell.edu/Education/Training/data10/ParallelVisualization.pdf

Kristoffer Hebert

Written by

I am working toward a simpler, more meaningful world. kristofferhebert.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade