How to Integrate PySpark, Snowflake, Azure, and Jupyter: Part 1

4 min readJun 5, 2020

Prepare to start, Snowflake + PySpark = 🌞

Purpose

At the end of this three-part series, you’ll be able to launch a Spark cluster running in Azure on HDInsight, query live data from Snowflake using the Snowflake Connector with pushdown capability, all done through Jupyter notebook, using a Python 3.5 virtual environment, and we’ll be exploring the TPC-H dataset , and we’ll do this in just about 90 minutes.

Background

I created this documentation out of need, there’s a good series made about two years ago in 2018, on the Snowflake Blog, however it doesn’t cover Azure HDInsight, nor Python as it’s tilted for Scala, and the Snowflake documentation online is useful, however it is more of a reference for details than it is for a “how to” instructional series, they’re good references so do have a read through them if you get stuck, but I hope this is gets you further along the way.

You can review the entire blog series here: Part One > Part Two > Part Three.

Audience

The intended audience are those engineers, enthusiasts, and hackers, that may already be using Snowflake but haven’t yet integrated it with PySpark and would like to do so.

Assumptions

This tutorial assumes:

You have a Snowflake account, so if you don’t, click here and get one set up, it’s pretty easy and my experience with their sales reps is that they know what they’re talking about, and are “real” engineers, which makes working with them a positive experience.
You run Snowflake in one of your preferred cloud providers, we’ll work with Azure, if you don’t have an account, you can get set-up with some free credits, so go ahead and do that now.
You can easily work with some of Snowflake test data like their weather sample, you can also use your very own data as it’s really easy to import data, but in this case, I’ll be using TPC-H data.
You will need access to your Azure portal and the CLI, I won’t be showing infrastructure-as-code, this should be a requirement for you to go production, haha as you should never be creating clusters in production through wizards or even the CLI.

Azure

Snowflake runs on GCP, AWS, and Azure. We’ll be using Azure for this tutorial. Side note, from my experience, each of these cloud providers are great. Azure is a little easier to get set-up than AWS and GCP. That’s not to say I wouldn’t use GCP or AWS because I’m a little bit of a fan-boy for one of them, so I sure will continue to do implement client Snowflake installations in those too, but wanted to give credit where it’s due, nice work Azure.

We’ll be using HDInsight. Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. You can use open-source frameworks such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, and more.

Snowflake

Snowflake has gained tremendous traction in the last few years, I’m seeing over 55% of our clients are now running Snowflake. This doesn’t mean that they have moved their entire data warehouse over to it, but they have bought credits, and are dipping their toes into the water, by using it on enterprise data warehouse, or to move their sql-ready analytical applications from other less scalable relational databases to Snowflake.

There are a few advantages of Snowflake, and these are from my experience: first it’s serverless so you only get charged for your query, second it’s multi-cloud so it decouples 🔌 your data warehouse from AWS/GCP/Azure, third it requires less DBA overhead than traditional data warehouses, fourth the data is micro-partitioned and stored in S3/GCS/BLOB, and fifth theres integration with PySpark with the ability to push down queries into the database.

PySpark

Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Apache Spark in Azure HDInsight is the Microsoft implementation of Apache Spark in the cloud. HDInsight makes it easier to create and configure a Spark cluster in Azure. Spark clusters in HDInsight are compatible with Azure Storage and Azure Data Lake Storage. So you can use HDInsight Spark clusters to process your data stored in Azure.

One of the key integrations, IMHO, is integration with PySpark, and in particular the pushdown (good article here too), because we need balance where sometimes it’s best to have the database handle certain operations like simple aggregations vs. having PySpark/Python handle more complex transformations like time series resampling and especially for machine learning and artificial intelligence, anyways, in this article I will take you through a few simple steps to get this all going, and feel free to comment if there is anything that you’d like to dig into further.

Jupyter

Jupyter Notebook is a very popular exploration and analytical decision making tool, used by Data Scientists and Hackers like me, it enables you to edit, run and share Python code as a “notebook”. It’s 100% browser-based, it allows you to step by step create a process that will shape the data how you need it for analysis, exploration, and reporting. There are cool libraries out there like papermill from Netflix, that enables data pipelines to consume the data outputted from a notebook, take a look at that, but it’s not part of this article.