MPI workloads performance on MapR Data Platform Part 1

Nicolas A Perez
Feb 12 · 6 min read

In the Big Data world, the MapR Data Platform occupies, without question, an important place given the technology advantages it offers. The ability to run mixed workloads that includes continues data processing with MapR Event Store, batch processing of humongous sizes based on the scalable and performant MapR-Filesystem, and storing virtually unlimited documents with any shape within MapR-DB, the NoSQL database, are only a few examples of how MapR has risen to the technological marvel it is today. However, some questions arise, is MapR able to run classic High-Performance Computers (HPC) workloads in a similar way traditional HPCs do? Are these 20 years old technologies able to keep up with new tools such as Apache Spark? Can we use MapR for running classing computational libraries in the same way we run Spark?

In order to analyze our questions, we should take concrete examples and compare them to see their behaviors, that ultimately will answer the matter in question.

At the same time, we will use examples that are CPU intensive to measure tooling performance on CPU bound tasks. We believe that is a fair comparison since some of the old libraries might not present storage access capabilities, such as access to HDFS, in the same way, new frameworks, like Apache Spark has.

We will use two specific examples, one on this post, and the second example in a follow-up post. The first one is described below.

In this post, we are going to implement the Sieves of Eratosthenes which is a classic algorithm for calculating prime numbers that is highly parallelizable. Our implementation is written in C, using MPI, and a similar implementation using Apache Spark that will run on Yarn. In both cases, we will measure the strict time taken to calculate the required primes, ignoring the overhead of job scheduling that Yarn adds, for instance.

Then, in a second blog post, we will implement a matrix multiplication algorithm which is another classing numerical problem that is CPU intensive, highly distributed and parallelizable. In this case, we will, again, use a pure C implementation using MPI and compare it with the default implementation offered by Spark MLlib. Of course, we could implement our own custom multiplication procedure using Spark, but the idea is to compare to what is already there.

Notice we are using MPI and Spark, and there are some reasons for it. MPI is the defacto standard to write multiprocessor programs in the HPC world. On the other hand, Apache Spark has grown in the big data space and cannot be ignored when talking about distributed workloads.

Finding Prime Numbers

If there is a problem that Computer Science students enjoy solving, that is finding prime numbers. There are different ways to do so, from very naive solutions to using very complex ones. In our example, we will use the sieve of Eratosthenes, but we will are not adding some possible optimizations for the sake of clarity of the code. Either way, both implementations, the one using MPI and the one using Apache Spark, will not have optimizations and will focus on the raw algorithm.

The following image shows how the algorithm works for finding primes less than the value 10 using 2 parallel processors. Think about scaling to a really large number of processors using the same techniques.

Now, let's implement this idea using MPI.

Let’s first start by some of the supporting functions we need. As we saw in the image above, each processor only has a piece of the data, in other words, our data is partitioned over the number of processors participating in the calculation.

The following function is called by each process and initializes the processor’s data.

Now, each processor needs to filter its own dataset based on a k value, in other words, if a value is divisible by k, then it is not prime and should be eliminated.

After this, the next k value must be globally selected, so each processor should select it’s own local k and then they (all processors) should agree on a global k value that is going to be used on the next iteration. The following function selects the local k value.

Now, our application becomes the following.

Notice that the MPI_Allreduce is used to agree on the global minimum, next k value.

The entire MPI code is the following.

We can try running this application on a 6 node MapR cluster by doing

mpirun -np 6 --oversubscribe sieves_of_eratosthenes 100000 0elapsed time: 5.550563

As we can see it can, it calculates all prime number smaller than 100000 in around 5.5 seconds.

If we run the same application with the same 6 nodes MapR cluster, but using a larger number of processor (we can do this because of the number of CPUs on each node) let’s see what happens.

mpirun -np 24 --oversubscribe sieves_of_eratosthenes 1000000 0elapsed time: 7.580463

As we can see, the application runs on impressive time considering the number of calculations being executed.

Now, the same application can be implemented in Spark. I will try to keep the code as simple as possible while going over the different parts of it.

The run function creates the data that is partitioned and distributed across the cluster. Then it calls the clean functions.

The clean function filters the data based on the k value as we did before on the MPI implementation. Then it calls nextK to find the next k value.

The nextK function just select the next possible value for k.

It is important to notice that in here we are calling .first that semantically equals the synchronization step that we implemented before using MPI_Allreduce .

The entire code looks like the next snippet.

Notice that there is a piece where we write our RDD down and reload it. This is necessary since the recursion step causes an issue with the DAG linage. There is not a simple solution around it, but ours solves the problem without major complications.

As we can see, in Spark we have higher levels of abstraction which ultimately makes our code simpler, but how fast is this code compare to our MPI code when executed on the same MapR cluster?

When running on the same MapR cluster, the Spark application is about 10x slower, part of it is the read and write of data to/from HDFS, but that only adds a small part of processing time.

Notes on MapR and MPI

We mentioned before that our tests are focused on CPU bound operations since MPI has some disadvantages when accessing HDFS data. However, since we are talking about MapR specifically, we could use the fully POSIX client instead of the HDFS interface. This means, that using pure C and MPI, we could access the distributed file system exposed by MapR without any problems. In a follow-up post, we will look at how good MPI is at accessing large scale data sets stored in MapR using the POSIX client.


If you thought that MPI was something from the past, you might want to reconsider. Even with lower abstractions, it is more than capable, especially when running in modern platforms like MapR. However, that puts a big question around classical HPC systems. The future is unknown, but MPI is a very well built technology, I will be surprised if it does not get included as part of the big data ecosystems soon enough. On the other hand, Apache Spark is the other monster in the room, and with higher abstractions, it is perfect for almost everything you can think of, thus it makes a lot of sense to master such tool even when sometimes it might perform worst than MPI.

The tradeoff between simple usage and high performance has been there for years and cannot be ignored, being aware of it helps us to decide on every situation.

Nicolas A Perez

Written by

Computer Science. Software Engineer @MapRTechnologies. Love Scala, Java, and C#.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade