Installing DataProc with Jupyter Notebook on Google Cloud Platform

Published in

SYNTIO

3 min readFeb 14, 2022

--

Dataproc is a powerful data processing engine that can be utilized in many ways, but also as a humble data lab.

In the process of building the foundations of a data layer platform it’s good to consider creating small data exploration area. In that area we can use simple storage and cost effective processing platform for basic tasks like basic data checks performed on pre-automated datasets. Google’s Cloud Dataproc is a powerful data processing engine that can be utilized in many ways, but also as a humble data lab. This blog entry gives short introduction how to set that up.

ENABLE API

First you need to enable API per picture below using following link.

INSTALL SDK

If not already installed, install SDK using instructions on following link.

CREATE BUCKET

If bucket does not already exist create one using CREATE BUCKET button on this link and define:

A unique bucket name: e.g. datalayer-storage
A storage class: e.g. Multy-regional
A location where bucket data will be stored: e.g. EU

And now: the Installation steps

CREATING CLUSTER USING GCP SHELL

Run the following command:

When prompted enter following value for region/zone: europe-west4/europe-west4-b

OR

USING GCP WEB UI (CONSOLE)

Navigate to DataProc Cluster List and click on CREATE CLUSTER button

Define cluster name, region and zone attributes value then click Advance options.

Define bucket and initialization action.

Click Create.

Post-installation setup -connecting to Jupyter notebook

Create an SSH tunnel

In GCP shell run following commands with proper values for PROJECT, HOSTNAME, ZONE attributes (WIN SO style) — CLICK FOR MORE DETAILS.

Configure your browser

In command prompt run following- commands with proper values for PROJECT, HOSTNAME, ZONE attributes (WIN OS style) — CLICK FOR MORE DETAILS.

Connect to the notebook interface using link http://dataproc-m:8123

Jupyter notebook short reference

Creating new notebook

Sample pyspark code

Connect to csv file (CSV_PATH in format gs://STORAGE_NAME/FILE_PATH)

Executing query showing results

Example:

Have fun! :)

Originally published at https://www.syntio.net., April 11, 2019.

Jupyter Notebook

Google Cloud Platform

Google Dataproc

Syntio

Written by Syntio

Editor for

SYNTIO

The Data Engineering company. Offering knowledge and cloud-based solutions to complex data challenges worldwide.

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams