How to Integrate PySpark, Snowflake, Azure, and Jupyter: Part 2

10 min readJun 5, 2020

Launch 🚀 a PySpark Cluster

Background

This is part two, of a three-part series. In part one we learned about PySpark, Snowflake, Azure, and Jupyter Notebook. Now in part two, we’ll learn how to launch a PySpark Cluster and connect to an existing Snowflake instance.

Step 1: Prepare for cluster build-out

Precursor

We will launch a production-grade PySpark Cluster, using an HDInsight image 💪, and then install external python packages, from PyPi, that don’t come pre-loaded with the base image, into the Python virtual environment across all of your cluster nodes. This means, if you wanted to install a special python package, then go right ahead, follow my steps and customize it for your needs.

External Python Packages

Make a list of python package you need on the cluster, this can be in the form of a requirements.txt file, or a simple list of names, I have only a couple of PyPi packages: seaborn, pyarrow, and plotly.

You can also install your own internal proprietary packages, even if they aren’t in PyPi, you need to do use python setup.py sdist to create your package it into a tar.gz file, and then include it in a location that’s accessible by the cluster, like in a Azure BLOB store, you can even do S3 or GCS if you’d like.

Pro Tip: Virtual Environment

This tutorial will not cover the installation of a python virtual environment on your cluster, but I do recommend that this is done prior to going into production, it’s a best practice to create virtual environments and it’s best because its safer to manage your own virtual environment on the cluster, so you don’t mix and match with the HDInsight deployed one.

I recommend you familiarize yourself with how to create a new python virtual environment and configure the cluster to use it, while I’ll take you through these steps, if there is something I don’t cover, chances are it’ll be in in one of these references that I’m providing links to.

Step 2: Launch PySpark Cluster

You have two ways to build your PySpark cluster on Azure, using the Wizard or the CLI, I’ll show you both, but remember that if you’re doing this for real in production, then I’d recommend using Terraform (which in that case is better for you to go the CLI route).

Two ways to build PySpark (A) Wizard or (B) CLI

Before you start, I’d recommend you have a quick read through this article from Microsoft to familiarize yourself with the process of creating a PySpark cluster, after you’ve done this, then proceed.

Launch using 🧙‍♂️ Wizard

First, let’s fire up a browser to https://portal.azure.com, navigate to HDInsight, and click on Create HDInsight Cluster.

Then in basics, build up your cluster so it looks like this, choosing PySpark 2.3, and HDInsight 3.6, and be sure to choose the region where your Snowflake instance is hosted and for me it’s in East US since I’m in Boston, you will need a username/password saved somewhere because you’ll be using this to later tossh , Jupyter to explore, and Ambari to manage the cluster.

Next, in storage, choose a primary storage account that you’ll be using to store data used between PySpark and Snowflake. I’m not using any metadata store for this tutorial but feel free to configure it if you’d like.

Choose all of defaults for security + networking if you have a vanilla configuration.

Now in, configuration + pricing, you can chose a small configuration, like I’ve done with mine below, just remember to shut down the cluster, after you’re finished, so you are not paying for it:

The node sizes above are quite small, but if you’re in test mode, then this should be good, so pick what you think is appropriate and then finish off creating the cluster. You’ll need to wait 15–25 minutes and that assumes you haven’t built a huge cluster, so keep that in mind.

CLI

Now, we’ll move into how to launch a cluster from the command-line, you should take a look at this article as a point of reference for how to create the cluster using the CLI, also I need to assume you’ve already set-up your Azure CLI, if you didn’t go here and get that done.

Next, pop open a terminal window:

Then login to Azure by running az login, it will open a browser tab, sign-in, and head back to terminal and you should see something like this:

Then copy/paste these environment variables into your favorite code editor like Atom, VS Code, Sublime, Notepad++, vi, emacs, etc. More importantly, edit it, and customize it for your needs, be sure to save it somewhere so you remember all of these details, we recommend to our customers that they use LastPass for storing private details like username/password.

export RESOURCE_GROUP=Advanti
export CLUSTER_NAME=PySparkSnowflake
export CLUSTER_TYPE=spark
export COMPONENT_VERSION=Spark=2.3
export PASSWORD='P95th&X%*8Q5RClX&4'
export LOCATION='East US'
export CLUSTER_VERSION=3.6
export AZURE_STORAGE_ACCOUNT=pysparksfstore
export AZURE_STORAGE_CONTAINER=pysparksfstore-2020-05-29
export CLUSTER_SIZE_NODE=2

2020–06 Warning: At the time of writing this, PySpark 2.4, HDInsight 4.0, Python 3.7, Snowflake 2.12, Snowflake-Connector 2.7.1, and Snowflake JDBC 3.12.5, don’t play together very well, so it’s best to use the versions I mentioned.

I am assuming you already have a resource group created, if you don’t, then just follow the steps in this article.

First you need to create a storage account:

az storage account create \
    --name $AZURE_STORAGE_ACCOUNT \
    --resource-group $RESOURCE_GROUP \
    --https-only true \
    --kind StorageV2 \
    --location $LOCATION \
    --sku Standard_LRS

Next, you’ll need the storage key value, so execute this command:

az storage account keys list \
    --account-name $AZURE_STORAGE_ACCOUNT \
    --resource-group $RESOURCE_GROUP

You will see a response like this:

[
  {
    "keyName": "key1",
    "permissions": "Full",
    "value": "0kUstqMu68mmGQOxNhUORAYto63iUEDWrN5bEjXtAKJR8eXAZVLnCbH+81H0hrdu8FSme3W95yqQ=="
  },
  {
    "keyName": "key2",
    "permissions": "Full",
    "value": "cv72JQJfyE8Ol+hHT8NMIN83wruz7kVcJR2X1b/ZAIuq6eI7GOBdzJ7IN2YIrH40OBaMntYRoKIEA=="
  }
]

Then take the first key1 value and place that into an environment variable like so:

export AZURE_STORAGE_KEY=0kUstqMu68mmGQOxNhUORAYto63iUEDWrN5bEjXtAKJR8eXAZVLnCbH+81H0hrdu8FSme3W95yqQ==

Then create your storage container:

You will need a storage container, if you don’t have one then let’s go ahead and create one like this:

az storage container create \
    --name $AZURE_STORAGE_CONTAINER \
    --account-key $AZURE_STORAGE_KEY \
    --account-name $AZURE_STORAGE_ACCOUNT

You’ll see a response like this:

{
  "created": true
}

Okay, last step!

az hdinsight create \
    --name $CLUSTER_NAME \
    --resource-group $RESOURCE_GROUP \
    --type $CLUSTER_TYPE \
    --component-version $COMPONENT_VERSION \
    --http-password $PASSWORD \
    --http-user admin \
    --location $LOCATION \
    --workernode-count $CLUSTER_SIZE_NODE \
    --ssh-password $PASSWORD \
    --ssh-user sshuser \
    --storage-account $AZURE_STORAGE_ACCOUNT \
    --storage-account-key $AZURE_STORAGE_KEY \
    --storage-container $AZURE_STORAGE_CONTAINER \
    --version $CLUSTER_VERSION

^ That will create you a HDInsight PySpark Cluster! Congratulations if you got this far, high five 🙌.

Now like I mentioned above for the Wizard, wait like 15–25 minutes, and then once it’s done, you’ll see it in the Azure management console. Go get some ☕️ you earned it.

This is the cluster I created in the HDInsight Cluster screen

Then if I dive in, I can see my cluster at a glance here…

Step 3: Install Python Packages

PySpark, out of the box, is preloaded with the core libraries that needed for basic data exploration, such as pandas, and most of the “how to” tutorials will assume that is all you need. However, for me, that’s rarely the case, I need certain packages installed from PyPi, my own internal non-public packages, or I want to be on the latest version of a particular library. So what we’ll first cover, is getting those packages loaded.

First things first, I recommend that you follow these instructions that walk you through step by step, how to create a script action, and bear in mind that this process of creating a script that the cluster will run, is required even if you’re using GCP or AWS. The key points are that you’ll need to create a storage account and container that is accessible by the HDInsight PySpark instance.

I created the shell script below, which instructs it to create a new python virtual environment (again this is a best practice), update to the latest version of conda if we haven’t already, install and update pip to latest version, and then install the two packages I need from PyPi. Also, if you’re wondering, you don’t have to use Anaconda, you can go straight for installing Python 3.7 virtual environment directly, but I won’t cover that option here, instead I’ll be using Conda.

Alright, let’s get run our custom scripts, we’ll head into the HDInsight Cluster console, click on “Script Actions” and then on “Submit new”:

Remember, that if you point the Custom Script Actions to the gist URI, it must be to the raw version, otherwise you cannot process HTML 😜. Here’s what it should look like:

https://gist.githubusercontent.com/daefresh/96ea0c442576e513cde3979c584fb680/raw/c2816ea0bd9a5c0303746071db7512c8976ee2dc/azure_pyspark_bootstrap.sh

We’ll create a script action that looks like this:

This☝️usually takes 10–15 minutes, so chill out until your screen refreshes, and you get a success!

Troubleshooting

If you run into any issue, then I recommend you try to execute the script directly on one of the nodes in the cluster, you can start off with one of the head nodes. You can do this by going into “SSH + Cluster login” and copying the ssh command to the clipboard.

Past that command into your terminal.

Once you connect, you’ll want to test your shell script, step by step, to determine what may not be working as expected. You’ll see in the screenshot below the steps I took: download the shell script using wget, make it executable using chmod, and then execute the script.

Should you run into an error, you’ll need to start problem solving by using Google, or some good old fashioned trial and error, 😆. Oh, make sure that your shell script is re-runnable!

2020–06 Warning: There’s a known bug for Anaconda version 4.7.11, 4.7.12, and 4.8.0. If you see your script actions hanging at "Collecting package metadata (repodata.json): ...working..." and failing with "Python script has been killed due to timeout after waiting 3600 secs". You can download this script and run it as script actions on all nodes to fix the issue.

Step 4: Quick Health Check

Now, there’s one more step, that we need to take, in order to make sure that everything is looking good, you’ll need to head into Ambari, by clicking “Ambari home”, you’ll find this tile in the overview cluster page.

This is your management console. Provided that everything is “all green” then you can proceed to the next step! If you see an error, then just make sure that you restart all of the services, and also ensure that you didn’t setup a 1 node or tiny 2 node cluster, because you may have issues due to not enough horsepower.

As a side note, I’ve noticed that my small cluster is quite chatty when idling, so be aware of that overhead, here’s a screenshot using htop for one of my cluster nodes, you can see it’s busy even when there’s no user requests

Step 4: Jupyter Smoke Test

First, fire up a Jupyter Notebook, you’ll see a tile on your HDInsight cluster, that looks like the screenshot below, and the URL will again use the same name as your cluster, in my case it’s: https://pysparkcluster.azurehdinsight.net/jupyter/tree

Then use the username/password you provided earlier when you created the cluster:

Next, create a new PySpark3 Notebook, and name it a friendly name that you’ll remember.

Then simply type the below into the first cell and run it:

import sys
print(sys)

My expectation is that Spark starts, and it prints the Python version installed, if this works, then you know your cluster was launched successfully!

Conclusion

㊗️ You have launched created a cluster, installed external packages, and connected using Jupyter. The next section we will get you connected to Snowflake and start issuing some queries!

About

Have any questions? Reach out to me on LinkedIn.