Google Cloud DataProc : Launch Hadoop-Hive-Spark Cluster in Google Cloud Platform(GCP)

Harsh Muniwala
Petabytz
Published in
5 min readSep 10, 2019

What is Spark?

Apache Spark is a fast and general purpose engine for large-scale data processing. You can write code in Scala or Python and it will automatically parallelize itself on top of Hadoop. It basically runs map/reduce.

What is Hadoop and HDFS?

Hadoop is a software library, which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It’s really a common library called Hadoop Common and a framework called Hadoop MapReduce that sits on top of a distributed file system, called HDFS.

What is Yarn?

The Hadoop Distributed File System or HDFS is a way to distribute file system data to a bunch of workers. The distribution, job scheduling and cluster resource management is done by a system called Yarn.

What is Hive?

Hadoop alone doesn’t know much about data structure and deals with text files. Most humans work with SQL, so the Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed HDFS storage using SQL. It lets you create and query a SQL schema on top of text files, which can be in various formats, including the usual CSV or JSON.

What is DataProc?

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them. With less time and money spent on administration, you can focus on your jobs and your data.

Lunch Hadoop-Hive-Spark in GCP:

Launching a Hadoop cluster can be a daunting task. It demands more than a day per node to launch a working cluster or a day to set up the Local VM Sandbox. VM Sandbox is highly susceptible to frequent crashes and corruption every now and then. We can leverage the cloud platform to solve this problem. This will make the developer’s life a little easier.

Google Cloud Dataproc is Google’s version of the Hadoop ecosystem. It includes the Hadoop Distributed File System (HDFS) and Map/Reduce processing framework. The Google Cloud Dataproc system also includes a number of applications such as Hive, Mahout, Pig, Spark, and Hue built on top of Hadoop.

In this blog, we will see how to set up DataProc on GCP. There are two ways to create DataProc Cluster; one is using the UI wizard and the second is using the REST/CURL command.

Dataproc Concept

Steps to Setup Google DataProc :

  1. Click on the Menu and navigate to Dataproc under the BIG DATA section
  2. Click the Create cluster button
  3. Give the cluster a name
  4. Optional Step — Create a Cloud Storage staging bucket to stage files such as Hadoop jars, between client machines and the cluster. If not specified, it uses a default bucket
  5. Optional Step — Create a Network for the Data Proc cluster which will be used by the Compute Engine. If not specified, the default network will be chosen for you
  6. Set your specific Machine type for your Master node(s) and Primary disk size (For dev config =>disk size:100 GB and RAM:18 GB)
  7. Set your specific Machine type for your Worker node(s) and specify how many nodes you require (For dev config => 2 Worker nodes, disk size:100 GB and RAM:12 GB)
  8. Click on the Create button to create the cluster when you are done
Creating Network for Data Proc Cluster

Creating DataProc Cluster using Web UI

Cluster creation in dataproc

Creating DataProc Cluster using Curl Command:

gcloud dataproc clusters create hadoop-hive-spark-dev-local-cluster — region us-west1 — subnet default — zone us-west1-a — master-machine-type custom-6–18432 — master-boot-disk-size 100 — num-workers 2 — worker-machine-type custom-4–12288 — worker-boot-disk-size 100 — num-worker-local-ssds 1 — image-version 1.3 — scopes ‘https://www.googleapis.com/auth/cloud-platform' — project hdp-multi-node-cluster-207321

Accessing Name Node and RM UI:

  1. http://YOUR-CLUSTER-IP-ADDRESS:8088/cluster (Change External IP with your VM Instance External IP)
Resource Manager

2. http://YOUR-CLUSTER-IP-ADDRESS:9870/dfshealth.html#tab-namenode

Name Node UI

Conclusion:

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them. With less time and money spent on administration, you can focus on your jobs and your data.

Limitations of DataProc:

  1. No choice of selecting a specific version of Hadoop/hive/spark stack
  2. You cannot pause/stop Data Proc Cluster
  3. No UI for managing cluster-specific configuration like Ambari/Cloudera Manager

--

--