# Parallel Matrix Multiplication [C][Parallel Processing]

Multiplying matrix is one of the tedious things that we have done in schools. As the dimensions of a matrix grows, the time taken to complete the calculation will also increase. Even for computers, the problem is there.

In this article, we will look into methods that could optimize matrix multiplication in several ways. At the end we are going to analyze the performance of Traditional Matrix Multiplication, Matrix Multiplication Using Parallel For Loops and Optimized Matrix Multiplication Using Parallel For Loops. We are going to run this matrix multiplication program for squared matrices in dimensions from 200 to 2ooo with a step of 200.

You might need following requiremnts to continue.

1. C Programming Language
2. OpenMP

Let’s first look at the basic mathematic behind multiplying matrices.

## Matrix Multiplication

Let’s consider arbitrary matrices A and B. Since the matrices are square matrices n = m = p.

So, the resultant matrix AB can be obtained like this,

## Generate Random Square Matrix

Let’s get into implementation by creating random matrices for multiplication. Here we are using malloc function to allocate memory dynamically at heap. Because when it comes to testing we have to deal with matrices with different dimensions.

`typedef double TYPE;#define MAX_VAL 1000#define MIN_VAL 1TYPE** randomSquareMatrix(int dimension){	/*		Generate 2 dimensional random TYPE matrix.	*/	TYPE** matrix = malloc(dimension * sizeof(TYPE*));	for(int i=0; i<dimension; i++){		matrix[i] = malloc(dimension * sizeof(TYPE));	}	//Random seed	srandom(time(0)+clock()+random());	#pragma omp parallel for	for(int i=0; i<dimension; i++){		for(int j=0; j<dimension; j++){			matrix[i][j] = rand() % MAX_VAL + MIN_VAL;		}	}	return matrix;}`

Here we have defined the data type as double which can be changed according to the use case. The #pragma omp parallel for statement will do the loop parallelization which we can initialize the matrix more efficiently.

We also need a square matrix with zero values to store the answer. It can be implemented in same manner as before by initializing with zero rather than a random value.

## Traditional Matrix Multiplication

Without considering much about the performance, the direct implementation of the matrix multiplication is given below. Operations will occur in sequential manner for each element at resultant matrix. Here matrixA and matrixB are input matrices where matrixC is the resultant matrix. So, we have to pass the resultant matrix into the function as a reference.

`double sequentialMultiply(TYPE** matrixA, TYPE** matrixB, TYPE** matrixC, int dimension){	struct timeval t0, t1;	gettimeofday(&t0, 0);	for(int i=0; i<dimension; i++){		for(int j=0; j<dimension; j++){			for(int k=0; k<dimension; k++){				matrixC[i][j] += matrixA[i][k] * matrixB[k][j];			}		}	}	gettimeofday(&t1, 0);	double elapsed = (t1.tv_sec-t0.tv_sec) * 1.0f + (t1.tv_usec - t0.tv_usec) / 1000000.0f;	return elapsed;}`

In this function it will return the time taken to complete the process. Next we will look into how we use parallel for loops to do this.

## Matrix Multiplication Using Parallel For Loops

When you are going implement loop parallelization in your algorithm, you can use a library like OpenMP to make the hardwork easy or you can use your own implementation with threads where you have to handle load balancing, race conditions etc. For this tutorial I am going to stick with the OpenMP library. We just need to add several lines to make this thing parallel.

`double parallelMultiply(TYPE** matrixA, TYPE** matrixB, TYPE** matrixC, int dimension){	struct timeval t0, t1;	gettimeofday(&t0, 0);	#pragma omp parallel for	for(int i=0; i<dimension; i++){		for(int j=0; j<dimension; j++){			for(int k=0; k<dimension; k++){				matrixC[i][j] += matrixA[i][k] * matrixB[k][j];			}		}	}	gettimeofday(&t1, 0);	double elapsed = (t1.tv_sec-t0.tv_sec) * 1.0f + (t1.tv_usec - t0.tv_usec) / 1000000.0f;	return elapsed;}`

Here we have used #pragma omp parallel to parallelize the outermost for loop. Using this statement it delegates portions of the for loop for different threads. Is this an efficient implementation? Yes, but we cannot clean our hands by just using this statement because there are facts that we have not considered for further optimizations.

Next we will look, how we could improve our solution further.

## Optimized Matrix Multiplication Using Parallel For Loops

What are the underlying problems in above implementation? We have not done any memory optimizations here. Without optimizing memory accesses processor could not execute instructions on data then and there.

Since, our matrices are stored in heap, it is not easy to access them as they stored in the heap. It is better to bring those data from heap to stack before start the multiplication process. So, we need to set containers initially for those data.

`TYPE flatA[MAX_DIM];TYPE flatB[MAX_DIM];`

What is the purpose of using one dimensional arrays to store to 2D matrices? Because, although we perceives a square matrix as 2D dimensional thing, computer memory is much more sequential. It is better at sequential access. When it comes to caching, it prefers related data together. In some caching algorithms, it believes that the data required for the next operation will be available in the next memory location. Then it can easily load those data to cache before CPU requests those data.

Steps of optimized matrix multiplication implementation is given below.

1. Put common calculation at one place

Most of the time we do not consider small calculations that redundant over the program where performance is not required but clarity is.

iOff and jOff calculations are redundant calculations for iteration of k. So, we made these calculations then and there and access where it required

2. Cache friendly algorithm implementation

We all know that memory has a linear arrangement. So, every N-dimensional array ordered sequentially inside the memory. In this example we can convert 2 dimensional input matrices into row major and column major 1 dimensional matrices. Why we need to use the both row major and column major. Simply to minimize the jumps between memory locations which have all the related elements together in the memory. This will also reduce the cache misses.

Eg: matrixA * matrixB

We can convert matirxA into row major and matrixB into column major arrangement. The operation also has parallelized using parallel for loops.

3. Using Stack Vs Heap Memory efficiently

It is fast access stack rather than heap memory. But stack has limited memory. We have stored the large input memories in heap memory. For efficient intermediate calculations we have used the stack with predefined memory allocations.

`double optimizedParallelMultiply(TYPE** matrixA, TYPE** matrixB, TYPE** matrixC, int dimension){	int i, j, k, iOff, jOff;	TYPE tot;	struct timeval t0, t1;	gettimeofday(&t0, 0);	convert(matrixA, matrixB, dimension);	#pragma omp parallel shared(matrixC) private(i, j, k, iOff, jOff, tot) num_threads(40)	{		#pragma omp for schedule(static)		for(i=0; i<dimension; i++){			iOff = i * dimension;			for(j=0; j<dimension; j++){				jOff = j * dimension;				tot = 0;				for(k=0; k<dimension; k++){					tot += flatA[iOff + k] * flatB[jOff + k];				}				matrixC[i][j] = tot;			}		}	}	gettimeofday(&t1, 0);	double elapsed = (t1.tv_sec-t0.tv_sec) * 1.0f + (t1.tv_usec - t0.tv_usec) / 1000000.0f;	return elapsed;}`

Here we have launched 40 threads to do the multiplication process. Internally we have divided the workload in static manner assuming that each multiplication instruction would take same amount of time. Since we are dealing with dimensions of 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800 and 2000, workload can be divided equally among threads.

In omp we have explicitly declared that matrixC as shared resource to avoid race conditions.

## Analyze Performance

Following table contains time taken for each dimension at each approach.

Following graph will give you a better idea of the increment of speedup at each point.

## Complete Code with Test Cases

Complete implementation of the given scenario is available in following Github repository. 