Apache Spark and Jupyter Notebooks made easy with Dataproc component gateway

Use the new Dataproc optional components and component gateway features to easily set-up and use Jupyter Notebooks

Tahir Fayyaz
Google Cloud - Community
10 min readMar 12, 2020

--

Apache Spark and Jupyter Notebooks architecture on Google Cloud

As a long time user and fan of Jupyter Notebooks I am always looking for the best ways to set-up and use notebooks especially in the cloud. I believe Jupyter Notebooks are the perfect tool for learning, prototyping, and in some cases production of your data projects as they allow you to interactively run your code and immediately see your results. They are a great tool for collaboration thanks to a background coming from being used and shared in scientific communities.

You might have used Jupyter notebooks on your desktop in the past with Python but struggled with handling very large datasets. However with many kernels now available you can make use of Apache Spark for distributed processing of large-scale data in Jupyter but also continue to use your Python libraries in the same notebook.

However getting an Apache Spark cluster set-up with Jupyter Notebooks can be complicated and so in Part 1 of this new “Apache Spark and Jupyter Notebooks on Cloud Dataproc” series of posts I will show you how easy it is to get started thanks to new features like optional components and component gateways.

Create a Dataproc cluster with Spark and Jupyter

You can create a Cloud Dataproc cluster using the Google Cloud Console, gcloud CLI or Dataproc client libraries.

We will be using the gcloud CLI from the Cloud Shell where gcloud is already installed (If you’re new to Google Cloud view the Getting Started with Cloud Shell & gcloud codelab).

You can also use the gcloud CLI locally by installing the Cloud SDK.

To get started once in the cloud shell or your terminal window set your project ID where you will create your Dataproc cluster

gcloud config set project <project-id>

Enable product APIs and IAM roles

Run this command to enable all the APIs required in the Apache Spark and Jupyter Notebooks on Cloud Dataproc series of posts.

gcloud services enable dataproc.googleapis.com \
compute.googleapis.com \
storage-component.googleapis.com \
bigquery.googleapis.com \
bigquerystorage.googleapis.com

If you are not the admin or do not have the correct permissions to enable APIs ask the admin for your GCP organization or project to enable the APIs above.

They will also need give you the correct Dataproc IAM roles and Google Cloud Storage IAM roles to create and use your Dataproc cluster.

Create a GCS bucket to be used by your Dataproc Cluster

Create a Google Cloud Storage bucket in the region closest to your data and give it a unique name. This will be used for the Dataproc cluster.

REGION=us-central1
BUCKET_NAME=<your-bucket-name>
gsutil mb -c standard -l ${REGION} gs://${BUCKET_NAME}

You should see the following output

Creating gs://<your-bucket-name>/...

Create your Dataproc Cluster with Jupyter & Component Gateway

Set the env variables for your cluster

REGION=us-central1 
ZONE=us-central1-a
CLUSTER_NAME=spark-jupyter-<your-name>
BUCKET_NAME=<your-bucket-name>

Then run this gcloud command to create your cluster with all the necessary components to work with Jupyter on your cluster.

gcloud beta dataproc clusters create ${CLUSTER_NAME} \
--region=${REGION} \
--zone=${ZONE} \
--image-version=1.5 \
--master-machine-type=n1-standard-4 \
--worker-machine-type=n1-standard-4 \
--bucket=${BUCKET_NAME} \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh

You should see the following output while your cluster is being created

Waiting on operation [projects/spark-jupyter-notebooks/regions/us-central1/operations/random-letters-numbers-abcd123456].
Waiting for cluster creation operation…

It should take 2 to 3 minutes to create your cluster and once it is ready you will be able to access your cluster from the Dataproc Cloud console UI.

You should the following output once the cluster is created:

Created [https://dataproc.googleapis.com/v1beta2/projects/project-id/regions/us-central1/clusters/spark-jupyter-your-name] Cluster placed in zone [us-central1-a].

Flags used in gcloud dataproc create command

Here is a breakdown of the flags used in the gcloud dataproc create command

--region=${REGION} 
--zone=${ZONE}

Specifies the region and zone of where the cluster will be created. You can see the list of available regions here. Zone is optional unless if you are using n2 machine types when you must specify a zone.

--image-version=1.4

The image version to use in your cluster. You can see the list of available versions here.

--bucket=${BUCKET_NAME}

Specify the Google Cloud Storage bucket you created earlier to use for the cluster. If you do not supply a GCS bucket it will be created for you.

This is also where your notebooks will be saved even if you delete your cluster as the GCS bucket is not deleted.

--master-machine-type=n1-standard-4
--worker-machine-type=n1-standard-4

The machine types to use for your Dataproc cluster. You can see a list of available machine types here.

Note: Look out for future post in the series on recommendations for what machine types to use and how to enable auto-scaling

--optional-components=ANACONDA,JUPYTER

Setting these values for optional components will install all the necessary libraries for Jupyter and Anaconda (which is required for Jupyter notebooks) on your cluster.

--enable-component-gateway

Enabling component gateway creates an App Engine link using Apache Knox and Inverting Proxy which gives easy, secure and authenticated access to the Jupyter and JupyterLab web interfaces meaning you no longer need to create SSH tunnels.

It will also create links for other tools on the cluster including the Yarn Resource manager and Spark History Server which are useful for seeing the performance of your jobs and cluster usage patterns.

--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage' 
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh

Installs the latest versions of the Google Cloud BigQuery python library and the Google Cloud Storage python library. These will be used to perform various tasks when working with BigQuery and GCS in your notebooks.

Accessing Jupyter or JupyterLab web interfaces

Once the cluster is ready you can find the Component Gateway links to the Jupyter and JupyterLab web interfaces in the Google Cloud console for Dataproc by clicking on the cluster you created and going to the Web Interfaces tab.

Alternatively you can get the links by running this gcloud command.

REGION=us-central1
CLUSTER_NAME=spark-jupyter-<your-name>
gcloud beta dataproc clusters describe ${CLUSTER_NAME} \
--region=${REGION}

Which will show an output with the links in the following format.

clusterName: spark-jupyter-<your-name>
clusterUuid: XXXX-1111-2222-3333-XXXXXX
config:
configBucket: bucket-name
endpointConfig:
enableHttpPortAccess: true
httpPorts:
Jupyter: https://random-characters-dot-us-east1.dataproc.googleusercontent.com/jupyter/
JupyterLab: https://random-characters-dot-us-east1.dataproc.googleusercontent.com/jupyter/lab/
...

You will notice that you have access to Jupyter which is the classic notebook interface or JupyterLab which is is described as the next-generation UI for Project Jupyter.

There are a lot of great new UI features in JupyterLab and so if you are new to using notebooks or looking for the latest improvements it is recommended to go with using JupyterLab as it will eventually replace the classic Jupyter interface according to the official docs.

Python 3, PySpark, R, and Scala kernels

Based on the image version you selected when creating your Dataproc cluster you will have different kernels available:

  • Image version 1.3: Python 2 and PySpark
  • Image version 1.4: Python 3, PySpark (Python), R, and Spylon (Scala)
  • Image version Preview (1.5): Python 3, PySpark (Python), R, and Spylon (Scala)

You should use image version 1.4 or above so that you can make use of the Python 3 kernel to run PySpark code or the Spylon kernel to run Scala code.

Creating your first PySpark Jupyter Notebook

From the launcher tab click on the Python 3 notebook icon to create a notebook with a Python 3 kernel (not the PySpark kernel) which allows you to configure the SparkSession in the notebook and include the spark-bigquery-connector required to use the BigQuery Storage API.

Once your notebook opens in the first cell check the Scala version of your cluster so you can include the correct version of the spark-bigquery-connector jar.

Input [1]:

!scala -version

Output [1]:

Create a Spark session and include the spark-bigquery-connector jar

Input [2]:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Jupyter BigQuery Storage')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.getOrCreate()

Create a Spark DataFrame by reading in data from a public BigQuery dataset. This makes use of the spark-bigquery-connector and BigQuery Storage API to load the data into the Spark cluster.

If your Scala version is 2.11 use the following jar

gs://spark-lib/bigquery/spark-bigquery-latest.jar

If your Scala version is 2.12 use the following jar

gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

We will create a Spark DataFrame and load data from the BigQuery public dataset for Wikipedia pageviews. You will notice we are not running a query on the data as we are using the bigquery-storage-connector to load the data into Spark where the processing of the data will happen.

Input [3]:

table = "bigquery-public-data.wikipedia.pageviews_2020"df = spark.read \
.format("bigquery") \
.option("table", table) \
.load()
df.printSchema()

Output [3]:

Create a new aggregated Spark DataFrame and print the schema

Input [4]:

df_agg = df \
.select('wiki', 'views') \
.where("datehour = '2020-03-03'") \
.groupBy('wiki') \
.sum('views')
df_agg.printSchema()

Output [4]:

Run the aggregation using the .show() function on the DataFrame which will start the Spark job to process the data and then show the output of the Spark DataFrame limited to the first 20 rows.

Input [5]:

df_agg.show()

Output [5]:

You should now have your first Jupyter notebook up and running on your Dataproc cluster. Give your notebook a name and it will be auto-saved to the GCS bucket used when creating the cluster. You can check this using this gsutil command.

BUCKET_NAME=<your-bucket-name>gsutil ls gs://${BUCKET_NAME}/notebooks/jupyter

Example notebooks for more use cases

The next posts in this series will feature Jupyter notebooks with common Apache Spark patterns for loading data, saving data, and plotting your data with various Google Cloud Platform products and open-source tools:

  • Spark and BigQuery Storage API
  • Spark and Google Cloud Storage
  • Spark and Apache Iceberg / DeltaLake
  • Plotting Spark DataFrames using Pandas

You can also access the upcoming examples notebooks on the Cloud Dataproc GitHub repo

Giving the cluster’s service account access to data

In the example above we are accessing a public dataset but for your use case you will most likely be accessing your companies data with restricted access. The Jupyter notebook and Dataproc cluster will attempt to access data in Google Cloud Platform services using the service account of the underlying Google Computer Engine (GCE) VMs and not your own Google credentials.

You can find the service account of your cluster by running this command to describe the master VM in GCE which will have the same name of your Dataproc cluster followed by -m

ZONE=us-central1-a 
CLUSTER_NAME=spark-jupyter-<your-name>
gcloud compute instances describe ${CLUSTER_NAME}-m \
--zone=${ZONE}

This will give a long list of attributes including the service account and scopes as shown in this example output.

serviceAccounts:- 
email: <random-number>-compute@developer.gserviceaccount.com
scopes: - https://www.googleapis.com/auth/bigquery - https://www.googleapis.com/auth/bigtable.admin.table - https://www.googleapis.com/auth/bigtable.data - https://www.googleapis.com/auth/cloud.useraccounts.readonly - https://www.googleapis.com/auth/devstorage.full_control - https://www.googleapis.com/auth/devstorage.read_write - https://www.googleapis.com/auth/logging.write

Alternatively you can view the service account in the Google Cloud Console by going to the VM Instances tab in your Dataproc cluster and clicking on the master VM instance.

Once in the VM page scroll to the bottom and you will see the service account for the VM. This is the same service account for all VM instances in your cluster.

You should then give the service account the correct BigQuery IAM roles and GCS IAM roles to access the BigQuery datasets or GCS buckets you need.

For more details on providing the correct access read this solution to Help secure the pipeline from your data lake to your data warehouse.

Deleting your Dataproc cluster

Once you have have finished all of your work within the Jupyter Notebook and all Spark jobs have finished processing it is recommended to delete the Dataproc cluster which can be done via the Cloud Console or using this gcloud command:

REGION=us-central1 
CLUSTER_NAME=spark-jupyter-<your-name>
gcloud beta dataproc clusters delete ${CLUSTER_NAME} \
--region=${REGION}

As mentioned before you can always delete and recreate your cluster and all your notebooks will still be saved in the Google Cloud Storage bucket which is not deleted when you delete your Dataproc cluster.

What’s Next

  • Look out for next post series which will cover using the bigquery-storage-connector in a Jupyter Notebook in more depth.
  • Follow me here on Medium (@tfayyaz) and on Twitter (tfayyaz) to hear more about the latest updates about Dataproc and share feedback.
  • Ask any questions in the comments or on Stackoverflow under the google-cloud-dataproc tag.
  • Have fun working with Spark and Jupyter Notebooks on Dataproc.

--

--

Tahir Fayyaz
Google Cloud - Community

Google Cloud Developer Advocate — Data Lake Lead. Writing about Apache Spark, Jupyter, Cloud Dataproc, Cloud Composer and BigQuery Storage.