Utilizing Spark Structured Streaming as a Middleware to Connect Microservices

An Apache Spark Application In Microservices Ecosystems

Published in

The Startup

6 min readApr 11, 2020

Articulating a problem reflects our knowledge about its domain. In the software and data world, we almost always encounter problems that have been there for years and a lot of contributions have been made by different teams around the globe, each of them resulting in a production of a unique approach. In this article, I want to show you one of these problems and two possible solutions, one of them by using pure Java and the other more robust one by leveraging Scala on top of Apache Spark.

Problem Definition

This is a common need in a microservices ecosystem when we want to connect different microservices in a manner that some of their outcomes generate others’ input. For this scenario, there are different approaches, considering functional and nonfunctional requirements and architectural constraints.

Suppose we have some microservices that generate some messages on an Apache Kafka cluster, writing them as a couple of key-values on a specific topic. We call these generator microservices, sources. On the other hand, there are some microservices that we want to update based on some calculations on the corresponding messages. We call these guys, destinations. Now we are going to design and implement a middleware microservice to serve as a bridge between these sources and destinations. It is obvious that one approach is changing the sources in a way that they calculate the request and call the destinations themselves. But note, if we make the sources do that, our ecosystem will end up in complete chaos. Of course, it is an antipattern. Besides, this is not a source’s responsibility to do the transformation and calling destination. This is not their business (Separation of Concerns). Consequently, building another component to work as a bridge between them is highly recommended.

Here is the scenario we are supposed to develop:

Reading messages generated by sources from the topic
Classifying messages based on their keys
Extracting their JSON messages
Calculating the required request fields to update destinations
Calling destination microservices REST APIs

High-Level Architecture

The high-level overview of the proposed solution is shown below:

Now the Processing Unit is needed to be designed and developed.

Solution 1: A Multiple Multithreaded Java Apps

High-Level Architecture of the Bridge Using Multiple Java Apps

This solution concentrates on what we and our technical team know for years: Java ExecutorService! The idea behind that is, it is simple to develop and easier to deploy. The only thing we need to do is to implement our version of the Runnable interface to connect to the Kafka topics that we are interested, and then, submit our threads to ExecuterService under the same Kafka ConsumerGroup. (As we need our messages to be processed just once, so if one thread consumes the message, the message must not be delivered to other threads)

The Architecture of the Multiple MultiThreaded Java Apps Bridge

After the development, the only thing we need is to dockerize the solution and deploy it on some VMs manually or just submit it to our Kubernetes cluster. It’s a piece of cake, isn’t it?

So do we have a unified cluster here? Is the solution fault-tolerant? How to check the health of our Java applications? What if some threads get stuck for no obvious reason?

With this solution, we cannot guarantee the immunity of the bridge because of the bad possible cases that we all know they may happen especially during the long run.

Solution 2: An Apache Spark Structured Streaming Application

“Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.” — Spark Structured Streaming docs

Here is our beloved Spark! Since version 2.3, Spark provided a new feature called Structured Streaming which is extremely compatible with Apache Kafka as one data source. The trick here is to see the messages as rows of data. The proposed solution looks like this:

High-Level Architecture of the Bridge Using Spark Structured Streaming

As we are going to use Append mode, the size of the input and the intermediate table is none of our concerns.

“Note that Structured Streaming does not materialize the entire table. It reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source data. It only keeps around the minimal intermediate state data as required to update the result.” — Spark Structured Streaming docs

As we need to implement our customized flow, we need to skip integrated data sources as the output sink and develop our version of the data destination. So here is handy Generic Abstract Class ForeachWriter that we need to implement three of its functions: open, process and close.

abstract class ForeachWriter[T] extends Serializable

“The abstract class for writing custom logic to process data generated by a query. This is often used to write the output of a streaming query to arbitrary storage systems. Any implementation of this base class will be used by Spark in the following way.
A single instance of this class is responsible for all the data generated by a single task in a query. In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.
Any implementation of this class must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(…) method has been called, which signifies that the task is ready to generate data.” — Spark Structured Streaming docs

One thing to keep in mind is, as of now, Spark 2.4.5 cannot guarantee exactly-once message delivery using ForeachWriter, because deduplication cannot be achieved with the couple of partitionId and epochId. (To have this guarantee we can use ForeachBatch instead. The reason that we are using ForeachWriter is that I have assumed for some cases you cannot run a batch job so you only can do operations one by one, e.g. when a REST API is needed to be called per each row). So you need to do that yourself using a logging mechanism if the at-least-once guarantee is not a fit for your situation.

High Availablity and High TPS (Transaction Processing System)

There are vast amounts of articles about Spark high availability and TPS on the web. Spark is fast and tunable because of Project Tungsten, SQL Catalyst (especially if you can transform and enrich your data in a batch way using Spark SQL engine), Memory Management and Tunning(where you can decide how to dedicate memory fractions for your data and calculations), Garbage Collector Tuning (sometimes people decide to take responsibility for the garbage and get rid of garbage collector :D), etc. You may take a look at some tips about tunning and optimization by its creators.