The Pros and Cons of Running Apache Spark on Kubernetes

Kubernetes support was only recently added for Spark. How does it compare to other deployment modes and is it worth it?

Hisham Itani
Data Mechanics

--

Apache Spark is an open-source distributed computing framework. In a few lines of code (in Scala, Python, SQL, or R), data scientists or engineers define applications that can process large amounts of data, Spark taking care of parallelizing the work across a cluster of machines.

Spark itself doesn’t manage these machines. It needs a cluster manager (also sometimes called scheduler). The main cluster-managers are:

  • Standalone: Simple cluster-manager, limited in features, built-in with Spark.
  • Apache Mesos: An open-source cluster-manager once popular for big data workloads (not just Spark) but in decline over the last few years.
  • Hadoop YARN: The JVM-based cluster-manager of Hadoop released in 2012 and most commonly used to date, both for on-premise (e.g. Cloudera, MapR) and cloud (e.g. EMR, Dataproc, HDInsight) deployments.
  • Kubernetes: Spark runs natively on Kubernetes since version Spark 2.3 (2018). This deployment model is gaining traction quickly as well as enterprise backing (Google, Palantir, Red Hat, Bloomberg, Lyft)…

--

--

Hisham Itani
Data Mechanics

Heading marketing @ ITMAGINATION — Your trusted technology partner for all your development needs. We do custom software, staff augmentation, and outsourcing.