A Beginner’s Guide to Dataproc

Drishti Gupta
Google Cloud - Community
7 min readNov 26, 2023

What is Dataproc

Dataproc is a Google-managed, cloud-based service for running big data processing, machine learning, and analytic workloads on the Google Cloud Platform. It provides a simple, unified interface for managing clusters of machines, running jobs, and storing data.

Coupled with other GCP data analysis tools, such as — Cloud Storage, BigQuery, Vertex AI — Dataproc makes it easy to analyze large amounts of data quickly and easily.

Dataproc Usecases

Various companies and market leaders are today using Dataproc to harness its capabilities of unparalleled data processing and analytics.

In 2021, Flipkart undertook the journey of migrating their infrastructure to Google Cloud Platform. They have been utilizing Dataproc to create a robust processing platform capable of handling over millions of messages per second for real-time analysis, and additionally scale petabytes of data in real time and batch mode.

Wayfair Group also centralise their data, perform operational reporting and derive actionable insights with the help of Dataproc, BigQuery and Google Cloud Storage, at high performance and low cost!

In fact, financial giant Paypal has also been using Dataproc and BigQuery to run their analytical workloads and Spark jobs.

An Introduction to Hadoop

Dataproc emerged as a response to the growing demand for efficient and scalable data processing solutions, particularly in the era of big data. While based on the same technology, Dataproc offers a managed, cloud-based alternative to traditional Hadoop.

Hadoop is an open-source framework for distributed storage and processing of large data sets using a cluster of commodity hardware. It is designed to scale from single servers to thousands of machines, offering a high degree of fault tolerance. Hadoop is based on HDFS, a distributed file system, and MapReduce programming model, which allows for parallel processing of data across a distributed cluster.

In this article we will explore the basics of Dataproc and create a cluster to run Hive jobs.

An Introduction to Apache Hive

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Hive was developed by Facebook to reduce the work of writing the Java MapReduce program. Built on top of Apache Hadoop, Hive also supports storage on third party tools like Amazon S3, Google Cloud Storage etc though HDFS.

We chose Hive for this demonstration because of its easy-to-understand, SQL-like syntax.

So let’s get started with some hands-on!

Pre-requisites

  1. A GCP Account and access to Google Cloud Console.

2. Dataproc API enabled, to interact with Dataproc services.

3. Dataproc Metastore API enabled, to allow access to hive metastore.

Part 1 : Creating a cluster

Step 1

First step is to create a Dataproc Cluster. Search for ‘Dataproc’ in the search bar of the GCP console and click on Create a Dataproc Cluster.

Here you’ll get two options a) cluster on compute engine or b) cluster on GKE. Let’s go with creating a compute cluster for simplicity.

Fill in the cluster name, region you want it to be created in, and the cluster type.

Clusters in Dataproc can be of 3 types:

  • Standard: In which there’s 1 master node to and N worker nodes (where you can specify what N should be).
  • Single Node: In which all the tasks of the master — managing cluster configurations, coordinating resource allocation — as well as the tasks of a worker — actually executing the tasks and processing data — is done by a single node. This mode is suitable for small, independent jobs.
  • High Availability: In which Dataproc allows you to create 3 master nodes for fault tolerance. In case of failure of 1 master node, the remaining two master nodes can continue to manage the cluster. The use of Apache ZooKeeper helps coordinate and elect an active master.
Setting up your Dataproc Cluster

Step 2

Set your node configurations in the ‘Configure node’ tab, if needed. For this example we will leave these as default.

Configuring Cluster Nodes

Step 3

While setting up the cluster you can also provide a custom image for your cluster. This image describes your cluster configurations, components and libraries to be pre-installed, specific to your jobs.

A simple cluster image can look like this:

#!/bin/bash
# Update package list
sudo apt-get update
# Install Airflow
pip install apache-airflow
Adding a cluster image (optional)

You can also specify Cluster Properties if needed. For example, setting dynamic partitioning to true.

hive.exec.dynamic.partition=true;
Setting Cluster Properties Example

Metastore Modes in Hive

While in this article, we use the default embedded mode, there are 3 metastore modes in hive.

Embedded: Both, the metastore service and the metadata, reside on the same cluster that is running your jobs i.e. the lifecycle of Hive metadata (table schemas) is the same as the cluster. (default)

Local: The metastore service runs on the same cluster but connects to an external database for persisting metadata.

External: Both the metastore service and storage are external to the cluster.

We can switch to Local or External mode by specifying the metastore properties:

hive:hive.metastore.uris=<metastore endpoint>
hive:hive.metastore.warehouse.dir=<hive-warehouse-directory>

Step 4

You can also change the security settings in the Manage Security tab, modifying encryption, accesses, authentication etc. Let’s leave these as default.

Customize security options according to your requirements

Step 5

Finally click on ‘Create’ to create the cluster.

CLI Equivalent for Creating a Dataproc Cluster using gcloud

You can also create a cluster, using gcloud, in a terminal window or in Cloud Shell, using the following command, and passing additional arguments for setting specific requirements.

gcloud dataproc clusters create my-cluster --region=us-central1

GCloud cluster create command with sample flags, exhaustive list here.

gcloud dataproc clusters create YOUR_CLUSTER_NAME --region=REGION --initialization-actions=CLOUD_STORAGE_URI --master-boot-disk-size=MASTER_BOOT_DISK_SIZE --node-group=NODE_GROUP

Part 2 : Submitting a Job

Now that our cluster is created, let’s submit some simple jobs to our cluster using HQL. Select your cluster, and go to SUBMIT JOB option at the top of the screen.

There are 2 ways in which we can submit our jobs, to the Dataproc cluster, on the console:

1. Using Text

In the window pane that opens, set the Job type as hive, and Query source type as ‘Query text’. Enter the query you want to run and Submit the job.

Let us create a database first.

CREATE DATABASE School;
Running a Query using Text

2. Using a File

Now, insert some data into our table and select it to see if it gives us the correct output.

Create your job file with .hql extension, and upload it to a bucket on gcs.

This time set the Job type as hive, and Query source type as ‘Query file’. Select your .hql file from the tab and Submit the job.

Logs for Job Run

We see from the logs that the database and table were successfully created and the data was correctly inserted into our database!

CLI Equivalent of Submitting a Job on Dataproc using gcloud

You can also submit jobs on Dataproc using gcloud in a terminal window or in Cloud Shell. You can add the --cluster-labels flag to specify your cluster labels.

gcloud dataproc jobs submit job-command \
--cluster=cluster-name \
--region=region \
other dataproc-flags \
-- job-args

The Conclusion

We learned the very basics of how to navigate Dataproc on Google Cloud, however there is a lot Dataproc has to offer. From supporting various other frameworks — spark, pig, cassandra, zookeeper — and running much more complex jobs to configuring and using dataproc metastore service, dataproc offers a one-stop-solution for all of one’s data analysis tools.

References and Further Reads

https://cloud.google.com/dataproc/?hl=en

--

--