Why every Spark Developer should care about Kubernetes

11 min readAug 13, 2018

There have been a litany of articles written recently introducing us to Kubernetes (k8s), reminding us of its astronomical rise in popularity and how one can use it to build distributed, scalable microservices. What we find is that the reader is soon overwhelmed with concepts/terminology and doesn’t grasp the full potential Kubernetes has for her specific application or service.

Instead, we focus on the capabilities in k8s from a big data engineer or scientist’s perspective. More specifically, building composite Spark applications while only touching on k8s terminology/concepts for clarity and as a reference for further reading.

In a subsequent post(s) we will walk through the specific commands and details for how one can realize the features described here.

PREDICTABILITY THROUGH CONSISTENT CONFIG, RESOURCES AND LIBRARIES

Spark applications are inherently heterogeneous. A Spark application depends on myriad third party products and libraries, for instance, analysis software in Python like NumPy, SciPy, SciKit Learn, Pandas and libraries for connectivity to data sources like Kafka, cloud storage, Hadoop, SQL and NoSQL databases.

Developers spend enormous amounts of time getting the pipeline for data processing to work only to find that they have to deal with constant changes to the software components (new versions, patches etc) resulting in a nightmare situation trying to maintain compatibility among the pieces and deliver high availability.

Moreover, Spark is dependent on having the right version of Java running everywhere, having some assurances on resources — memory, temporary space for overflow/shuffle and reliable access to storage. We even find variations in the use of OS versions, patches and just configurations across a cluster resulting in unexpected faults at runtime. For instance, inconsistent availability of memory resulting in swapping of Java heap to disk, running out of /tmp space on a newly added executors, etc.

If you have worked on distributed systems you know that such problem scenarios are almost infinite. What we need is a more disciplined, consistent approach to how any software is configured and deployed. We find that the simplest way to achieve more consistency and predictability when dealing with several services or components is containers (you likely guessed this already).

And, containers make up the core of Kubernetes.

One container by itself, however, is just a useful way to package an application; many containers, working together and running at scale, can transform an enterprise. Kubernetes, at its core, manages containers, provisions, scales and controls access to them across distributed teams.

“Containers offer a logical packaging mechanism in which applications can be abstracted from the environment in which they actually run. This decoupling allows container-based applications to be deployed easily and consistently, regardless of whether the target environment is a private data center, the public cloud, or even a developer’s personal laptop. Containerization provides a clean separation of concerns, as developers focus on their application logic and dependencies, while IT operations teams can focus on deployment and management without bothering with application details such as specific software versions and configurations specific to the app” ….

In essence they provide a darn efficient way to sandbox each application. — [container basics]

So, for instance, Spark developers can package into a single container image all of the dependent data store connectors and all of the ML/numerical analysis libraries that are compatible to the version of Spark or Hadoop the application team chooses to use. Developers no longer have to struggle through compatibility issues. Now they can figure out the dependencies once and make the image available to the entire team.

With Kubernetes, you can choose the container repository to download images at runtime, use configMaps to inject configuration properties for individual services or coarse grained configuration like endpoints for services, or use Helm charts to create, version, share, and publish composite applications (i.e. Several services/components stitched together). Spark applications can also dynamically stage dependencies like their application JARs that use HDFS or HTTP servers when baking dependencies into container images gets to be cumbersome.

RESILIENT DRIVERS, EXECUTORS, STATE:

The Spark cluster manager will reschedule any failed executors automatically as long as the workers (hosts that run Spark executors) are still up and running. But, if the worker node dies it isn’t possible for the standalone cluster manager to automate provisioning of new workers. Moreover, resiliency for your drivers requires one to use Zookeeper as a distributed coordinator which is yet another component to configure, stand up and debug. Kubernetes schedules all Spark components (Drivers, executors) as Pods (a managed entity that wraps a container) and continuously monitors the health of these components.

If Spark jobs are launched as a Kubernetes Job, then failures will automatically trigger re-instantiation of the driver pod and establish connectivity to all dependent services. This re-instantiation resets configuration and re-launches jobs. Another way to accomplish this is via the Kubernetes Operator pattern. In Kubernetes environments like GKE, the underlying platform can also dynamically provision additional VMs (optimizing resource utilization) and add these as resources into the Kubernetes cluster before launching the Spark driver pod.

Kubernetes follows the principles of micro services and separates compute from storage. Failures or loss of physical nodes has no bearing on access to data. Kubernetes supports multiple storage systems and abstracts your Spark application from underlying idiosyncrasies by mounting volumes into Spark executors and drivers. And, these storage systems often offer HA features. E.g. AWS EBS, Ceph, VMWare vSAN, ScaleIO, etc all offer built-in replication features to protect your data from failures.

EASIER ECOSYSTEM PROVISIONING, INTEGRATION

Apache Spark supports connectors to a large number of data sources — NoSQL and relational databases, disparate file formats like Parquet, CSV, ORC etc., object stores and file systems such HDFS, S3 and Azure’s blob store. During development, developers want to quickly stand up their own isolated environment: data stores, storage, development tools like Jupyter or Zeppelin, libraries, and more without impacting other users.

It turns out, given the popularity of Kubernetes, most of the popular products are already supported. Using a single kubectl command you can deploy, scale or monitor your own isolated database, notebook, and spark cluster. With Helm, one can provision your entire application, i.e. your application JARs along with the products you are dependent on as a release. Later you can upgrade your release, or stop the entire release with just a single command. Take a peek at this kubeapps hub — you can pick from over 200 popular products and deploy to your Kubernetes cluster with just a few clicks/commands.

Kubernetes is pre-configured with a sizeable list of “stable” charts and a even larger list that is “incubating”. A Helm chart abstracts away all the complexity of deploying a data source or big data tool on Kubernetes by provisioning resources such as containers, volumes (storage), services (to expose port and IP address to access a service), config maps etc by templatizing the configuration and allowing the user to override the sensible defaults. Helm charts also provide easier upgrades and rollback between versions.

MANAGE RESOURCES MORE COLLABORATIVELY, EFFICIENTLY:

In large enterprises central IT might provision a single large compute farm to be shared or carved up across teams. But each team or user community may want to manage their own resources and policies. Further, administrators may need to control resource consumption for each of the teams.

Imagine a scenario where the IT folks have provisioned a 1000 cores for big data workloads but now have several teams competing for the use of these resources. One possible solution would be to simply carve up the resources based on some initial assessment, but what might be more efficient is to create “fungible” policies to optimize resource utilization. So, for instance, a team allocated 500 cores could balloon to use more resources when demand is low.

Kubernetes provides an elegant way for you to manage such requirements via Namespaces and the Resource Quota features. Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces. Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. Essentially, namespaces are a way to divide cluster resources across multiple users or teams. (via resource quota).

A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per namespace. It can limit the quantity of objects that can be created in a namespace by type, as well as the total amount of compute resources that may be consumed by resources in that project.

MULTI-TENANCY

As noted earlier, through the use of containers, namespaces and resource quotas one can create and manage many virtual clusters — one per tenant. It is worth noting that unlike virtual machines (Hypervisor, VMWare), containers are much more light weight making it more appealing to deploy isolated applications. That is, each tenant runs their own containers. And, Kubernetes even provides a pluggable networking layer (CNI) which makes it possible to secure your tenant applications all the way to the network layer. For instance, the Kubernetes platform from Pivotal (PKS) includes NSX-T to programmatically manage software-defined virtual networks, and to keep your K8s network secure.

MULTI-CLOUD

Cloud neutrality and portability across clouds is very attractive to enterprises. Our customers not only want independence from any particular cloud but desire portability across public clouds and their in-house cloud based on VMWare infrastructure. Today, when deploying your app to any public cloud you are required to go through multiple cloud specific steps to get the infrastructure ready for your app. For example, provisioning compute instances, configuring the networking layer and firewall rules, provisioning storage, exposing the service endpoints and provisioning load balancers are all vendor specific steps that will need to be repeated for each cloud.

While there is still work to be done for a simple lift-n-shift across clouds, Kubernetes takes over much of these operational tasks making the process significantly simpler. Here is a bold claim from the Google:

“Kubernetes is built to be used anywhere, allowing you to orchestrate across on-site deployments to public clouds to hybrid deployments in between. This enables your infrastructure to reach your users where they’re at, your applications to have higher availability, and your company to balance your security and cost concerns, all tailored to your specific needs.”

Kubernetes provides a framework for managing infrastructure and applications in a consistent way whether on a particular cloud or on premise. For a Spark developer and administrator, common devops tasks will come down to executing simple Kubernetes Kubectl commands. Spark developers can provision resources such as persistent volumes (for storage), service endpoints (to expose your service), uniformly manage configuration across distributed services using ConfigMaps, pass security data using secrets and so on in a cloud independent way.

Spark/BigData apps need access to large storage volumes and are very sensitive to access speed. And, most often your app is tightly coupled to the cloud vendor-provided storage like EBS, S3 or GCS. Kubernetes provides persistent volumes — an abstraction that decouples your app from the underlying storage system. In fact, you can easily mount multiple storage volumes and switch between them. The eco-system for storage is also rapidly emerging. For instance, projects such as rookprovide a cloud-native storage orchestrator. This makes it easy to provision a storage solution of user’s choice in a cloud environment.

Kubernetes is now available on all major clouds — GCP(GKE), AWS(EKS), Azure(AKS), Pivotal PKS (runs on VMWare infrastructure and GCP) and many more. To summarize, the promise of Kubernetes is true portability across any cloud; providing a true multi-cloud experience and avoiding vendor lock in.

DYNAMIC SCALING: UP, DOWN AND SIDEWAYS

You may be aware that Spark lets you scale executors up or down as per the workload via its Dynamic Resource Allocationfeature. But it is limited by the resources available, to the machine or the VM. You can avoid this by using an external resource manager like Yarn, Mesos or better, Kubernetes.

Kubernetes allows users to specify the resource quantity in term of ‘requests’ and ‘limits’ for resources like cpu and memory. Your container initially gets allocated the quantity requested and Kubernetes then monitors the load on the resources in containers and expands the container resources up to the specified limit. Users can also configure the frequency at which the resources are monitored.

What makes Kubernetes appealing is that you can change the requested CPU or memory on the fly and Kubernetes will automatically migrate your Pod (e.g. your Spark executor). For instance, say you requested 8 cores and all available nodes in the cluster have 8 cores. Later you want your deep learning algorithm to run on a container with 64 cores. On platforms like GKE, you can use node pools to automatically expand the nodes available to Kubernetes with an appropriate instance type (64+ cores) and then migrate increase your executor pods to these instances with no manual intervention.

Another way to expand your application’s load capacity is to increase the number of containers (or pods) that are running/hosting your app. Kubernetes allows users to accomplish this via its Horizontal Pod Autoscaler feature. Depending upon resource consumption on the existing pods, Kubernetes dynamically spawns new pods to keep the resource metrics within the limits specified.

As endpoints are exposed using Kubernetes Services, which provides load balancing and serves as a proxy, Pod termination or migration has no impact to the client app.

Below is an illustration showcasing these concepts on two different nodes.

Node A has pod’s containers configured with resource requests and limits such that containers can resize themselves as needed. And Node B has pods configured with Horizontal Pod Autoscaler (Pod 1 replicating into 3 more pods, when needed).

BETTER CENTRALIZED MANAGEMENT AND MONITORING

In order to make sure that analytics jobs scale and run in a consistent and expected way, a Spark developer or administrator may need to examine various metrics. Typically an analytics system consists of multiple big data products such as Kafka, notebook environments, Apache Spark and data sources working in tandem. A user may want to see metrics for each tool as well as overall cluster usage metrics in order to identify bottlenecks and take actions.

In Kubernetes, a developer can monitor various resources such as containers, pods, services, and the entire cluster. Kubernetes exposes various metrics useful for the developer and there are tools available to make use of these metrics and provide the information in a nice layout. For example, cAdvisor (Container Advisor) helps monitor container level usage and performance characteristics, Prometheus is an open source tool that monitors Kubernetes, nodes and provides querying features for visualizing data. It enables centralized infrastructure monitoring by collecting various metrics out of the box. Additionally a tool like Grafana can be used along

with Prometheus to create dashboards for visualization of metrics data.

A Spark developer can thus monitor the cluster resource metrics using Prometheus. Spark provides a configurable metricssystem that has ability to report metrics to various sinks. Thus a Spark developer can use a sink for Prometheus as given here to view Spark driver/executor level metrics.

To sum up, Kubernetes provides a whole host of ways to make provisioning and configuring Spark clusters easier and more transparent. We at SnappyData are currently working on our Snappy-on-k8s offering, you can follow the github repo here.