Running Spark 3 on AKS with Azure AD integration

Published in

datamindedbe

5 min readNov 2, 2020

Do you want to run Spark 3 on AKS in pro mode? Meaning no more “just copy-paste the storage account access key into the source code, and do spark-submit”? Then this is for you!

Demo mode

One can run spark on AKS “out of the box”. It does work, it can calculate the value of pi to a certain degree of accuracy. However, the moment one wants to do something useful, such as reading/writing from a storage account, one typically resorts to giving Spark the storage account access keys. This might be acceptable for small teams working on a few projects.

Pro mode

Once you have more than a few persons working on more than a few projects, things start to get messy. In demo mode, every Spark job can read/write to the whole storage account. It can connect to every Key Vault. Generally, once you grant access for any Spark job to an Azure resource, every Spark job has access to it!

They way to improve the demo mode is by using identities attached to the spark jobs running on AKS. These identities can then be given fine-grained access to the different Azure resources (Key Vaults, Storage Accounts, CosmosDB etc).

So let’s see what it takes to run Spark in pro mode!

Prerequisites

a kubernetes (AKS) cluster (something recent such as 1.17)
storage account for your data (ADLS Gen2)
ACR to hold your images

There will be terraform snippets here and there, because, as we are in pro mode, we don’t click left and right in the Azure portal anymore!

Initialize AKS

Enable the SystemAssigned identity

Azure supports Managed Identities for various resources (Data Factory, App Service etc). Enable this for AKS, as this will form the basis of our authentication mechanism.

Allow AKS to pull images from the container registry

We need to assign the “AcrPull” role to the AKS managed identity (created in the previous section), which will enable AKS to pull any image from the Azure Container Registry (ACR).

resource "azurerm_role_assignment" "acr_pull" {
  principal_id = data.azurerm_user_assigned_identity.k8s_identity.principal_id
  role_definition_name = "AcrPull"
  scope = azurerm_container_registry.acr.id
}

The managed identity of AKS does not play well with terraform, that’s why you see azurerm_user_assigned_identity in the code.

Install aad-pod-identity

Azure can assign user-defined identities to each pod by using a kubernetes service called aad-pod-identity. This is an open-source project, not directly affiliated to Microsoft, however, the official documentation refers to it here. I will assume that you have followed their tutorial and you are familiar with the basics.

Once you install aad-pod-identity, you can simply label your pods with the appropriate identities. This service will make sure that the specific node (a virtual machine) running your pod will get assigned the correct identity. For this to work, however, we need our Spark cluster to be able to adjust the identities of the VMs running as kubernetes nodes. So, we assign the two following roles to our cluster’s managed identity:

Managed Identity Operator
Virtual Machine Contributor

Validate this setup

At this stage it is useful to confirm that the whole thing works as intended. To keep the test simple, you will need

a docker image with a python script reading stuff from a storage account
an identity which our pod will assume
an ADLS Gen2 storage account (filesystem initialized) with some example files.

The identity would look similar to this one

resource "azurerm_user_assigned_identity" "spark_aks" {
  location = var.location
  name = "sparkaks"
  resource_group_name = azurerm_kubernetes_cluster.k8s.node_resource_group
}

resource "azurerm_role_assignment" "spark_aks_sa_contrib" {
  principal_id = azurerm_user_assigned_identity.spark_aks.principal_id
  scope = azurerm_storage_account.datalake.id
  role_definition_name = "Reader"
}

Make sure you adjust the access to your example files and set read/write access for this identity.

The example python script uses the Azure SDK to read some data from the storage account:

from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential() 

client = DataLakeServiceClient(
    "https://<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net", 
    credential
)

fs = client.list_file_systems()
print("Found following file systems")
for f in fs:
    print(f.name)

for f in ["working/test.csv", "forbidden/test.csv"]:
    print(f"Opening file {f}")
    try:
        file_client = client.get_file_client("data", f)
        download = file_client.download_file()
        downloaded_bytes = download.readall()
        print(downloaded_bytes)
    except:
        print("Failed")

You still need to deploy two k8s resources, an AzureIdentity (defining the “sparkaks” identity and relating it to the Azure identity we created) and an AzureIdentityBinding.

Once you have them, you can run the test pod:

apiVersion: v1
kind: Pod
metadata:
  name: demosdk
  labels:
    aadpodidbinding: sparkaks
spec:
  containers:
    - name: demo
      image: <YOUR_DOCKER_IMAGE>
      imagePullPolicy: Always
      command:
        - python3
      args:
        - /sdk-test.py
  nodeSelector:
    kubernetes.io/os: linux

The python script should be able to read some files, and should fail to access some other files.

Create a Spark image

Build Spark 3 from sources

Usually the pre-built images don’t have kubernetes support enabled, so you will need to build everything yourself. Luckily, running Spark on kubernetes is a well documented process. I strongly recommend that you follow the whole tutorial, and make sure that you can actually run the SparkPi example on your AKS cluster.

There are a few caveats:

for Spark 3 you will want Java 11
for accessing ADLS Gen2 via a managed identity, you will need Hadoop 3.3

Java 11 can be enabled by passing an argument to the docker-image-tool :

./bin/docker-image-tool.sh -r <IMAGE> -t <TAG> -X -b java_image_tag=11-jre-slim build
./bin/docker-image-tool.sh -r <IMAGE_PYSPARK> -t <TAG> -X -b java_image_tag=11-jre-slim -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

Once you have this image, use it as a base to install Hadoop 3.3.

Use Hadoop 3.3

The ADLS Gen2 Storage Account can be used by Spark via a Hadoop layer:

spark.read.parquet("abfss://<FILESYSTEM>@<STORAGE_ACCOUNT>.dfs.core.windows.net/data.parquet/")

Here, abfss refers to the ABFS driver from hadoop-azure package. From the different authentication mechanisms, we will need the “Azure Managed Identity”. Based on this document, we arrive to the following hadoop configuration:

'fs.azure.account.auth.type': 'OAuth',
'fs.azure.account.oauth.provider.type':     'org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider',
'fs.azure.account.oauth2.msi.tenant': '<TENANT_ID>',
'fs.azure.account.oauth2.client.id': '<CLIENT_ID'

Run a pyspark script

Here is a minimal pyspark testing script:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder \
    .getOrCreate()# set all hadoop config
spark._jsc.hadoopConfiguration().set('KEY', 'VALUE') path = "abfss://<FS>@<SA>.dfs.core.windows.net/data.parquet/"df = spark.read.parquet(path)
print(df.count())spark.stop()

If you don’t call spark.stop(), you pod will never terminate.

Now, to actually submit this Spark job to AKS, you can use the following command:

./bin/spark-submit \
  --master k8s://<YOUR_AKS_MASTER>:443 \
  --deploy-mode cluster \
  --name sparkaks \
  --proxy-user 185 \
  --conf spark.executor.instances=1 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=<SPARK_IMAGE> \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.kubernetes.driver.label.aadpodidbinding=sparkaks \
  --conf spark.kubernetes.executor.label.aadpodidbinding=sparkaks \
    local:///test.py

This requires Spark to be installed on the machine which executes this command. Usually we use Airflow for task scheduling, so we run the above command on the Airflow server directly.

I work at Data Minded, an independent data engineering and data analytics consultancy based in Leuven, Belgium. We built and ran Data Platforms with Kubernetes clusters and processed massive amounts of data. If you need help with your Data Platform, contact us!