Spark at Onera: A Hybrid Cloud Approach

Published in

Engineering @ Onera

6 min readSep 19, 2019

The Best of EMR and Dataproc

Apache Spark is useful in processing large workloads at scale, and is one of the most widely used parallel processing frameworks in the industry. Cloud providers — including both Amazon Web Services (AWS) and Google Cloud Platform (GCP) — offer Spark as a hosted service through their Elastic Map-Reduce (EMR) and Dataproc services, respectively.

In this post, we discuss the advantages and disadvantages of these two hosted services (per our processing needs at Onera), as well as our design to utilize them both simultaneously. This post is not about how to implement Spark applications or how to use each of these cloud hosted services, but rather about the hybrid cloud approach we have adopted at Onera.

Comparing the Hosted Spark Services

AWS EMR has been around much longer than GCP Dataproc, and is viewed as a more mature platform compared to the latter. Apart from both offering hosted Spark services, there are some significant differences. We analyzed these differences to identify which aspects favor one platform over the other and influence design decisions on the architecture deployed at Onera.

Native support for asynchronous job execution

One of the major limiting factors on AWS EMR is that it surprisingly does not provide native support for asynchronous job execution through its default interface to add a “Spark Step” so that available resources are utilized to their full potential. Multiple Spark jobs submitted to the EMR queue as Spark Steps are processed one after the other in a sequential fashion. In order to enable concurrent execution of multiple Spark jobs on EMR, we implemented a workaround by explicitly setting the Yarn configuration parameter spark.yarn.submit.waitAppCompletion to false. This allows the jobs to be submitted and executed asynchronously in a non-blocking fashion. On the other hand, GCP Dataproc supports this out of the box and the workaround is not needed.

Documentation & community support

EMR has the advantage of being significantly more mature than Dataproc, thereby providing significantly richer documentation and support.

Cluster spin-up time

Here, Dataproc is the clear winner. EMR has a typical cluster spin-up time of around ten minutes, while Dataproc requires just two minutes. Additional issues were uncovered on the EMR side, in that the required Spark and Yarn services do not automatically restart on a cluster reboot, making the cluster unusable unless these services are restarted via additional scripting overhead. Dataproc is more reliable in this aspect and a cluster reboot does not render the cluster unusable.

Dollar cost

As per the workloads at Onera, we found EMR to be more expensive compared to Dataproc. A detailed cost analysis for executing Spark on the two platforms will follow as a separate post.

Since we utilize AWS S3 for data storage, data movement in and out of S3 to Dataproc incurs additional cost; however even with this overhead, overall we still found Dataproc to be cheaper.

Stability

Given the above findings, ideally we would favor Dataproc over EMR, primarily over easier cluster management, lower spin-up time, and lower dollar costs. Due to the lower stability on Dataproc, however, we have experienced higher downtime on Dataproc than on EMR.

Based on these experiences, we decided to adopt a hybrid approach where we utilize both platforms and develop a seamless abstraction over them in order to easily utilize either of the two services as the need arises. This also matches Onera’s philosophy of implementing its architecture as a modular system with interactions among microservices, and being flexible in configuration.

A Hybrid Cloud Approach

In order to develop a seamless and platform-agnostic Spark service, we should be able to initiate Spark jobs to be submitted to either platform: EMR and Dataproc. AWS has been Onera’s primary cloud platform in that our primary pipelines execute on AWS; incorporating GCP services into our architecture requires a secure and authenticated communication to be established between the two platforms’ VPCs to enable seamless resource access and sharing among our microservices.

Using long lived credentials to provide resource access is a simpler approach, but requires a “secure location” to store these credentials, and leads to a single point vulnerability if the secure location gets compromised.

Instead, we opted for authentication and resource access through encrypted short lived security tokens (STS). This provides an additional layer of security, since the credentials expire after a short period of time.

Spark jobs are initiated through our pipelines running on AWS. (An overview of our microservices system architecture calls for another blog post.) A Spark job submitted to EMR for execution inherits the limited access security within our VPC on AWS (seen as the blue and purple steps in the below illustration).

Whenever a Spark job needs to be executed on Dataproc — which is an external resource to AWS — STS credentials are generated and encrypted through our key management microservice, and are transferred over to Dataproc with the Spark job. The driver on Dataproc decrypts the credentials using the same key management microservice, and subsequently uses them for resource access on AWS (as illustrated below in the blue and yellow steps).

API Design and Configuration Management

Now that we can submit Spark jobs on either platform securely, we need a cloud platform-agnostic Spark job management API that can be used within to initiate and submit Spark jobs seamlessly on either platform.

We have implemented OneraClusterUtils interface to provide such an abstraction, with methods to submit Spark jobs asynchronously and query these jobs for their completion status. A platform-specific implementation of this interface was then realized for each of the two cloud services to handle the platform-specific details. The abstraction design is not just limited to these two cloud providers, but can be extended to other cloud services as well, such as Microsoft Azure. The following graphic visualizes this abstraction API.

The OneraClusterUtils implementations also integrate a configuration microservice, which can be used to determine the hosted cloud service via a unified configuration file. In addition, multiple cluster environments are supported, such as separate development teams working on independent projects. The API gives the flexibility of selecting the cloud provider simply by specifying the service to use as a string in the configuration.

The microservice takes care of submitting any Spark jobs initiated through the pipeline to the Spark cluster on the platform of choice, along with any provider-specific configuration parameters as defined for a given environment through the configuration management. This configuration management microservice also determines the Spark and Yarn configurations through any defined parameters. (An abstraction of performance-related Spark parameters was also architected, which will be discussed in detail in a future post.)

spark.json
{
    ...
    "spark_cluster_properties": {
        "spark_cluster_provider": {
        "<application_name_1>": "emr",
        "<application_name_2>": "dataproc",
        ...
    },
    "spark_cluster_environment":  "production"
}emr.json
{
    ...
    "production_env": {
        "cluster_id": "<cluster_id>",
        ...
    },
    "foo_env": {
        "cluster_id": "<cluster_id>",
        ...
    },
    "bar_env": {
        "cluster_id": "<cluster_id>",
        ...
    }
}dataproc.json
{
    ...
    "production_env": {
        "cluster_id": "<cluster_id>",
        ...
    },
    "foo_env": {
        "cluster_id": "<cluster_id>",
        ...
    },
    "bar_env": {
        "cluster_id": "<cluster_id>",
        ...
    }
}

What’s Next?

Accessing Spark logs seamlessly across multiple hosted services is yet another major hurdle that we’ll discuss — with details on unifying the Spark log viewability with our pipelines — in another blog post.

With the launch of our platform-agnostic Spark service, we are now able to execute the Spark jobs on the desired platform with just a configuration variable. In order to mitigate any Spark cluster downtime on either platform, we have also implemented an automatic Spark cluster failover service — which, of course, also calls for its own future blog post.

Since the incorporation of GCP Dataproc into the Onera pipelines, we have further extended the use of GCP in our architecture in conjunction with AWS, making Onera’s system architecture truly a hybrid cloud platform.