Google Cloud DataProc : Launch Hadoop-Hive-Spark Cluster in Google Cloud Platform(GCP)

Published in

Petabytz

5 min readSep 10, 2019

What is Spark?

Apache Spark is a fast and general purpose engine for large-scale data processing. You can write code in Scala or Python and it will automatically parallelize itself on top of Hadoop. It basically runs map/reduce.

What is Hadoop and HDFS?

Hadoop is a software library, which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It’s really a common library called Hadoop Common and a framework called Hadoop MapReduce that sits on top of a distributed file system, called HDFS.

What is Yarn?

The Hadoop Distributed File System or HDFS is a way to distribute file system data to a bunch of workers. The distribution, job scheduling and cluster resource management is done by a system called Yarn.

What is Hive?

Hadoop alone doesn’t know much about data structure and deals with text files. Most humans work with SQL, so the Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed HDFS storage using SQL. It lets you create and query a SQL schema on top of text files, which can be in various formats, including the usual CSV or JSON.

What is DataProc?

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them. With less time and money spent on administration, you can focus on your jobs and your data.

Lunch Hadoop-Hive-Spark in GCP:

Launching a Hadoop cluster can be a daunting task. It demands more than a day per node to launch a working cluster or a day to set up the Local VM Sandbox. VM Sandbox is highly susceptible to frequent crashes and corruption every now and then. We can leverage the cloud platform to solve this problem. This will make the developer’s life a little easier.

Google Cloud Dataproc is Google’s version of the Hadoop ecosystem. It includes the Hadoop Distributed File System (HDFS) and Map/Reduce processing framework. The Google Cloud Dataproc system also includes a number of applications such as Hive, Mahout, Pig, Spark, and Hue built on top of Hadoop.

In this blog, we will see how to set up DataProc on GCP. There are two ways to create DataProc Cluster; one is using the UI wizard and the second is using the REST/CURL command.

Steps to Setup Google DataProc :

Click on the Menu and navigate to Dataproc under the BIG DATA section
Click the Create cluster button
Give the cluster a name
Optional Step — Create a Cloud Storage staging bucket to stage files such as Hadoop jars, between client machines and the cluster. If not specified, it uses a default bucket
Optional Step — Create a Network for the Data Proc cluster which will be used by the Compute Engine. If not specified, the default network will be chosen for you
Set your specific Machine type for your Master node(s) and Primary disk size (For dev config =>disk size:100 GB and RAM:18 GB)
Set your specific Machine type for your Worker node(s) and specify how many nodes you require (For dev config => 2 Worker nodes, disk size:100 GB and RAM:12 GB)
Click on the Create button to create the cluster when you are done

Creating DataProc Cluster using Web UI

Creating DataProc Cluster using Curl Command:

gcloud dataproc clusters create hadoop-hive-spark-dev-local-cluster — region us-west1 — subnet default — zone us-west1-a — master-machine-type custom-6–18432 — master-boot-disk-size 100 — num-workers 2 — worker-machine-type custom-4–12288 — worker-boot-disk-size 100 — num-worker-local-ssds 1 — image-version 1.3 — scopes ‘https://www.googleapis.com/auth/cloud-platform' — project hdp-multi-node-cluster-207321