cheatsheet: How to run Spark 2.4 on OSX/MiniKube

Prerequisites

  • VirtualBox
  • Docker
  • Kubectl
  • MiniKube
  • Spark

Installation

Kubectl, Docker, MiniKube, VirtualBox

brew update && brew install kubectl && brew cask install docker minikube virtualbox

Verify

docker --version                
docker-compose --version
docker-machine --version
minikube version
kubectl version --client

Spark

https://spark.apache.org/downloads.html

Download and unpack into some folder. Say /opt/spark-2.4

Additional configuration

Add memory

minikube config set memory 8192

Add CPUs

minikube config set cpus 4

Configure RBAC

Allow service account default:default access namespace default

kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default

Start MiniKube

minikube start

Output should look like

Starting local Kubernetes v1.13.2 cluster…
Starting VM…
Downloading Minikube ISO
181.48 MB / 181.48 MB [================================] 100.00% 0s
Getting VM IP address…
Moving files into cluster…
Downloading kubelet v1.13.2
Downloading kubeadm v1.13.2
Finished Downloading kubeadm v1.13.2
Finished Downloading kubelet v1.13.2
Setting up certs…
Connecting to cluster…
Setting up kubeconfig…
Stopping extra container runtimes…
Starting cluster components…
Verifying kubelet health …
Verifying apiserver health …
Kubectl is now configured to use the cluster.
Loading cached images from config file.
Everything looks great. Please enjoy minikube!

Building and Deploying Spark Image

Find docker-image-tool.sh file in the spark folder and run

/opt/spark-2.4/bin/docker-image-tool.sh -m -t spark build

It should end up in deploying a linux based image containing java8 and spark version 2.4 installed into minkube's local docker repository. To check it we need connect to that particular daemon.

eval $(minikube docker-env)
docker image ls

You should be able to see image names spark tagged spark among others. spark:spark

Now we are ready to run Spark Jobs accessible from the local environment of the POD . So basically it means we can deploy our apps alongside with spark. Which is cool, but looks limited.

Get prepared for running jobs

Let’t determine minikube's ip address.

minikube ip

Copy the output. Let’s assume it is 192.168.99.101

Example

We can run example Pi job coming with spark installation.

So.. Go to Spark’s bin folder. And run

./spark-submit \
--master k8s://https://192.168.99.101:8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=spark:spark \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar

Making sure it’s been run

If everything went fine we should be able to check logs and see the result.

Figure out a pod name

kubectl get pods

You should see something like

NAME                            READY   STATUS      RESTARTS   AGE
spark-pi-1548218924109-driver 0/1 Completed 0 110m

Then we can cat or grep a log from that pod.

kubectl logs spark-pi-1548218924109-driver | grep "Pi is roughly"

This should give us a result.

Pi is roughly 3.144075720378602

TODO: Job delivery

There are several approaches

For details look at https://spark.apache.org/docs/latest/submitting-applications.html