Architecting smart applications on top of Apache Spark
Here I would like to share a set of architectural guidelines for building a platform for data scientists and big data engineers that will allow them to expose analytics as services for enterprise solutions. We’ll walk through design principles and review available opensource components.
After reading this paper you should be able to:
- Deliver more value for end users from data science models.
- Architecturally connect the big data stack with other microservices.
- Provide stable and resilient orchestration of Apache Spark programs.
- Plug the cultural gap between big data engineers, data scientists and web developers.
The architecture we outline here works well with different use cases. For example, consider the following:
- Applications for business users and tenants — revenue forecaster, rides simulator, campaign optimization, predictive maintenance optimization and others.
- Applications for consumers — recommendation engines, shopping cart bots and others.
- Applications for data scientists and analysts — hosted Notebook, ETL builder and data cleansing apps.
From a business standpoint, all of these applications drive the adoption of data driven processes and empowers users with realtime data insights. In addition, they force users to interact with data models in a way that provides feedback to data scientists.
Spark as a Service
An architecture has to recognize the distinction between the big data backend and web frameworks. Data scientists are used to working with SQL and Spark abstractions. The web-developer prefers to use use REST and HTTP, as that hides the complexity of big data processing and modeling pipelines.
To bring together the different approaches, what is needed is middleware that will receive requests from a web application, execute a Spark program and return the results. Yet this will not force the developer and data scientist to use an approach they are not comfortable with.
There are following challenges to consider:
- Reliability — a common workflow for executing Apache Spark programs is to have a few jobs, scheduled by cron running, in the cluster and crunching data. A design should take care of managing concurrent requests coming from multiple users and applications.
- Interoperability — a unified API required to run fast ad-hoc, long running and streaming jobs and then consume their results. REST is not enough.
In Apache Spark terminology, you should be able to run different jobs in different Spark Contexts. Then you should manage, isolate and dynamically allocate resources for these contexts and jobs.
SparkContext is expensive in terms of performance. So, in order to provide instant job execution there is the need to maintain a pool of SparkContexts.
Consider running different types of Apache Spark jobs, like streaming, ad-hoc and batch jobs in parallel. Or you might even need to execute jobs for different tenants. In order to guarantee the reliable isolation of Spark Contexts, it is recommended to create them in different JVMs or Docker containers. Master node manages multiple SparkContexts in different JVMs or Docker containers and acts as a single cluster.
When the topic is resource management, the first thing comes to mind is Mesos or YARN. It is very well known how to run Spark in cluster mode. But here we need to manage resources between driver programs as well since they are run concurrently on the same cluster. As a solution our middleware service should be implemented as a Mesos framework itself, so it can allocate resources for Spark programs and make sure they do not die.
An analogy to this approach is the Jenkins Master/Slave setup. There slave instances are spun up dynamically, as Docker containers, and allocated resources from Mesos or other resource managers.
Conceptually, a great data model should expose analytics as a service. Think about the data scientist who built the Google Cloud Prediction, Amazon or Azure ML model. This fronts a reliable backend and yields data in JSON format, which makes it easy to consume by lots of programming languages and tools.
So what options do we have to expose this data model?
- HTTP API
- Messaging API
- Reactive API
- Shared database
The opensource tools Hydrosphere Mist and spark-jobserver both provide the ability to trigger a Spark job with an HTTP request and retrieve the results.
However there are significant differences:
- Mist provides a more friendly abstraction for web developers versus the low level API of the spark-jobserver.
- Spark-jobserver does not provide a clear way to pass parameters needed for a job, e.g., run an analytics job for a set time frame.
With really large big data volume and complex calculations it does not always work well to use a blocking HTTP API even with “fast” 5–10 seconds Spark queries. You cannot have HTTPS requests waiting to be handled. So, it is better to use MQTT, AMQP or any existing messaging platform to deliver the response to the client application.
Asynchronous mode could be simulated by using HTTP callbacks or a polling mechanism. It does not require dependency from messaging systems. Yet that introduces added complexity in both systems. For example, a polling job needs to be added on the client side and failover logic implemented on the server side when the callback fails to kick off.
What about consuming the results of Spark Streaming or intermediate results of a batch modeling job? For example, that might include real-time predictive maintenance alerts, chat messages from a personal assistant bot or predictive ad campaign tracker.
So, an API should support triggering a Streaming job and providing a feedback loop or a way to consume the results. Hydrosphere Mist includes that feature in its roadmap. So, this will allow building reactive applications on top of Apache Spark.
A shared database relates to business intelligence or data warehousing scenarios where Spark jobs prepare data views and then your application acts as a BI tool and simply visualizes prepared reports.
It is out of the scope for this paper, but it‘s worth mentioning that both the Spark backend and client application could use their own storage to cache and index those data sets if needed.
We’ve already mentioned spark-jobserver and Hydrosphere Mist as services that provide a REST API to applications deployed on top of Apache Spark. Spark-jobserver is a much more mature project. But the Hydrosphere Mist product includes the conceptual and architectural advantages listed below and discussed above.
There are other tools too. IBM Kernel (Apache Toree) and Cloudera Livy are mainly focused on the Hosted Notebook model, where they execute a code snippet and send the results back. But they are worth mentioning since Notebook is one of the applications we could build on top of Apache Spark.
Here we give a side-by-side tool comparison. We list features that are important for building interactive and reactive applications on top of Apache Spark. The tools we mention also have many more features suitable for other use cases too.
If you found these architectural points helpful and relevant for your big data platform and if you see potential benefit to your business in building interactive realtime applications, which leverage big data models like this, please look at how to guide at the next blog post.
I welcome comments and suggestions. Please feel free to reach out to me at firstname.lastname@example.org.