MPI, Big Data and Microservices: 3 Takes on Distributed Computing

Daniel Blazevski
5 min readDec 6, 2017

Over the past 4 years, I have developed and seen a variety of software in which the code is executed on more than one machine. This post gives a closer look at a few different approaches, with a bias towards my own path in the domain.

I will go over the benefits of doing large-scale numerical simulations using MPI — a predecessor to the MapReduce framework — and discuss distributed computing in the contexts of building data pipelines and microservices.

I hope this post will be especially useful for those new to distributed computing or those familiar with one aspect and want to learn more. This will be especially useful for the academic familiar with MPI interested in learning more about what distributed systems is like in industry.

Before there was MapReduce: Granular Control Over Data in Large Scale Simulations using MPI

In government national labs, universities, energy and defense corporations, a common way to split a task that takes to long to run on a single machine is to distribute the computation using MPI.There are other higher-level ways to divide a simulation to multiple processes, but when one wants large-scale performance and fine-grained control over data, MPI is still a gold standard.

The relatively new language Julia has been gaining popularity over the years, and has a neat more RPC based framework for parallel computing. It will be interesting to see how Julia could change to what extent MPI will continue to be in the future. Jonathan Dursi has a few interesting opinions on this.

But what exactly is specific to MPI and why is it still used so widely in this specific of a domain? The reason why it is used heavily in a niche space can be summarized as follows

Pro: It ALLOWS you to manage inter-process communication by specifying which data /variables are in which processes at specified times in the computation.

Con: It FORCES you to manage inter-process communication by specifying which data /variables are in which processes at specified times in the computation.

An example is when one wants to the motion of particles in, say, a nuclear reactor.

As the image on the left suggests, MPI has SEND and RECEIVE methods to allow one process tracing a particle in a spatial sub-grid to send the particle’s info to another process to continue tracing the particle.

Managing such inter-process communication is really not what big data tools like Spark or MapReduce were built for. Nor is this ideal for HTTP frameworks that send and receive JSON blobs of data telling clients and servers how to process incoming requests. To put is more concisely,

MPI allows, with simple SEND and RECV abstractions, for one process to update the state of a variable on another process.

Abstracting Inter-process Communication for Aggregations and Transformations on Collections: Big Data

Tools like Spark, Dataflow, Flink, etc have well-thought out abstractions that allow one to do aggregations and transformations (ETL) on data sets. In many business use-cases, the logic is similar to pulling out data via a SQL query, hence I also put tools like AWS Redshift and Google’s BigQuery in here too. The key feature of these tools is

Applications to process data can be written without even having to know the data resides on multiple machines.

I deliberately chose to put Redshift and BigQuery on this list to make the case even more clear. Both are engines to execute SQL on data, where both the data and the processing are distributed. In the case of BigQuery, I generally do not even know how many machines my data lies on, nor how many machines are used to execute a query.

Which is GREAT! The business moves fast when each team designing a data warehouse doesn’t have to think about how many physical resources to allocate for their data for analytical processing.

Whether in pure SQL or using something like Spark’s Dataframe API, there is no managing communication between processes.

The primary concern when writing applications is (or should be): What is the right set of operations for my business logic? After that, the biggest concern is whether this operation is efficient enough for the scale of the data or whether we can we tweak the operations to make it efficient enough for others can have access to data in a timely fashion.

Using the right file formats, scheduling jobs effectively, allowing for efficient re-runs of tasks if things go wrong, and tweaking jobs to optimally use distributed operations to make data available to use within a reasonable time frame are the main focus, in addition to getting the transformations correct.

Making and Processing Requests Efficiently: Microservices

A paradigm for building out large software systems to build out components as microservices. A common implementation of this is using HTTP to make and process requests.

There are two sides of building out performant services, namely (1) being able to simultaneously process multiple incoming requests and (2) processing each incoming request efficiently.

For (1) we run the exact same code on several machines and those machines never talk to each other. This is purely meant to spread out the load of processing incoming requests.

For (2) that is where multithreading comes in. Abstractions like Futures in Java and Goroutines in Go make doing this relatively easy. When processing an incoming request, a service may likely have to call other services. Multithreading allows, for example, one server processing a request to simultaneously call two separate services.

To summarize

The application developer has to manage communication between services, though the nodes within a service typically do not communicate to each other to execute business logic.

This is illustrated here where two requests come in, are processed on different nodes, and the request simultaneously makes request to another service.

Concluding remarks

I hope this sheds light on at least a few different ways in which the world does distributed computing. For me, it took a while to fully understand the parts of the landscape, though now that I understand more of the landscape and differences between various modes of distributed computing, thought I’d share them with you.

--

--

Daniel Blazevski

Software Engineer at Spotify. Distributed systems and data