Migrating Data Processing Hadoop Workloads to GCP

Published in

Google Cloud - Community

5 min readApr 30, 2020

Spending a sleepless night to get a pipeline to work on a Hadoop cluster sounds familiar, doesn’t it? The role of a data engineer involves a host of activities from development to DevOps. Working with the ever-growing Hadoop eco-system makes for a challenging job. Developing a pipeline feels like the easier part, when you spend way more time to get the elephant moving!

Now that you are considering moving your infrastructure to Google Cloud, you come across Dataproc, Google Cloud’s managed open source data and analytics processing platform. Dataproc allows you to run Apache Spark and Hadoop jobs seamlessly in the cloud. It sounds good and you’re intrigued, but migrations can be tedious and scary… Let’s break it up and simplify it!

Creating and configuring a cluster

Creating a Hadoop cluster using Dataproc is as simple as issuing a command. A cluster can use one of the Dataproc-provided image versions or a custom image based on one of the provided image versions. An image version is a stable and supported package of the operating system, big data components and Google Cloud connectors.

You can use the latest image version, (which is the default) or select an image that matches your existing platform. You can also select additional supported components such as Druid and Presto. If you are using components such as Livy, which are not yet available, you can make use of initialization actions, which run an executable or a script on each node of the cluster immediately after cluster setup.

You have the option to create a standard cluster with one master or a high availability cluster with three master nodes. The high availability option is ideal for always-on long running clusters. If you want to get started right away, fire up your Cloud Shell and run the following command:

# create a cluster in us-central1
gcloud dataproc clusters create my-first-cluster \
--region=us-central1

Storage

After you have created a cluster and added any optional components to it, you need to address the elephant in the room. You have a lot of data and you keep adding more every day. How do you move that to Google Cloud? Some of these data transfer options can guide you on the journey.

You can create a conventional HDFS setup by attaching Persistent Disks (PD) to the nodes in your cluster.

A popular and recommended alternative is to use Google Cloud Storage as the persistence layer. Because Cloud Storage is a highly scalable and durable storage system that provides strongly consistent operations, it reduces the management overhead involved in HDFS. To switch from HDFS to Cloud Storage, simply change the file path prefix from hdfs:// to gs://. You can also use table formats such as Delta Lake and Iceberg for schema support, ACID transactions, and data versioning. Choose the right storage technology for the cluster, because no solution works for all use cases.

Regardless of your choice of storage, you should use a central Hive Metastore for your Dataproc clusters by using MySQL on CloudSQL as your database.

In time, consider migrating to BigQuery, a serverless petabyte scale data warehouse. To simplify this migration, Dataproc comes bundled with the BigQuery connector and BigQuery Spark connector.

Data processing

Typical architecture for running a job on Dataproc

After your cluster is ready and your data is accessible, you want to process and analyze data. You need to run complex pipelines joining data from multiple sources to generate reports. You might also need to support data analysts, who run ad-hoc SQL queries on the data using components such as Hive, SparkSQL or Presto.

Dataproc makes it very simple to submit jobs to the cluster using the jobs.submit method of the Dataproc API, without the need to set up perimeter instances. It uses the central Identity and Access Management layer to authenticate and authorize requests, allowing you to run jobs as securely as possible on private clusters. Try the command on Cloud Shell to run a program on the cluster you created earlier, that calculates the value of Pi.

# Submit a Spark job using Dataproc API
# Outputs: Pi is roughly 3.1419864714198646
gcloud dataproc jobs submit spark \
--cluster my-first-cluster --region us-central1 \
--class org.apache.spark.examples.SparkPi \
--jars file:///usr/lib/spark/examples/jars/spark-examples.jar \
-- 1000

For more complex workloads involving multiple interdependent jobs forming a Directed Acyclic Graph (DAG), you can use Dataproc Workflow Templates. You can use the Dataproc API with Cloud Composer, Google Cloud’s managed version of Apache Airflow.

Hive, Presto and SparkSQL engines are often used for ad-hoc querying by data analysts and for serving reporting dashboards. Read more about using them with your Dataproc cluster.

Monitoring and security

The central Identity and Access Management layer enables easy access to a job’s output or an audit trail using Cloud Logging. To manage and monitor a cluster’s resources and applications, enable Dataproc Component Gateway to help securely access the Hadoop web interface. Use Cloud Monitoring or Cloud Profiling for your advanced use-cases, such as health and performance monitoring, generating insights or dashboards.

Optimizations

Now that you have migrated your data platform, you want to optimize it to reduce operational overhead and cost. By using Cloud Storage as your storage and an external Hive Metastore, you have separated storage and compute. This allows you to focus on your workload rather than the cluster. Add to that the 90-second cluster creation time of Dataproc, central IAM and logging, and you can create on-demand ephemeral clusters which are deleted as soon as the execution of your pipeline is completed.

To help ensure security and isolation, run each cluster with a separate service account, with access to only those resources required by the job. This service account is also helpful when monitoring, auditing, and debugging. Since there is no interdependence between workflows, you can easily test and roll out different versions of Hadoop components and your service. As the clusters are not always on, you save on cost. You can further reduce costs by using Preemptible instances to run your tasks, as they are priced lower. Bear in mind not to use them in HDFS-bound tasks, because they can be pre-empted at any time.

A common overhead while running a cluster is resizing, to handle bursty loads, like stream processing or ad-hoc data-analysis queries. Dataproc provides both manual scaling and auto-scaling, which uses YARN metrics to control the number of nodes.

Enough talking, I leave you to try one of the quickstarts for yourself…