Installing DataProc with Jupyter Notebook on Google Cloud Platform
Dataproc is a powerful data processing engine that can be utilized in many ways, but also as a humble data lab.
In the process of building the foundations of a data layer platform it’s good to consider creating small data exploration area. In that area we can use simple storage and cost effective processing platform for basic tasks like basic data checks performed on pre-automated datasets. Google’s Cloud Dataproc is a powerful data processing engine that can be utilized in many ways, but also as a humble data lab. This blog entry gives short introduction how to set that up.
ENABLE API
First you need to enable API per picture below using following link.
INSTALL SDK
If not already installed, install SDK using instructions on following link.
CREATE BUCKET
If bucket does not already exist create one using CREATE BUCKET button on this link and define:
- A unique bucket name: e.g. datalayer-storage
- A storage class: e.g. Multy-regional
- A location where bucket data will be stored: e.g. EU
And now: the Installation steps
CREATING CLUSTER USING GCP SHELL
Run the following command:
When prompted enter following value for region/zone: europe-west4/europe-west4-b
OR
USING GCP WEB UI (CONSOLE)
Navigate to DataProc Cluster List and click on CREATE CLUSTER button
Define cluster name, region and zone attributes value then click Advance options.
Define bucket and initialization action.
Click Create.
Post-installation setup -connecting to Jupyter notebook
Create an SSH tunnel
- In GCP shell run following commands with proper values for PROJECT, HOSTNAME, ZONE attributes (WIN SO style) — CLICK FOR MORE DETAILS.
Configure your browser
- In command prompt run following- commands with proper values for PROJECT, HOSTNAME, ZONE attributes (WIN OS style) — CLICK FOR MORE DETAILS.
Connect to the notebook interface using link http://dataproc-m:8123
Jupyter notebook short reference
Creating new notebook
Sample pyspark code
- Connect to csv file (CSV_PATH in format gs://STORAGE_NAME/FILE_PATH)
- Executing query showing results
Example:
Have fun! :)
Originally published at https://www.syntio.net., April 11, 2019.