Hadoop Ecosystem In Google Cloud Platform (GCP)

Basics of Hadoop Ecosystem

Hadoop Ecosystem can be reviewed as a suite which encloses a number of services (ingesting, storing, analyzing and maintaining) inside it. It is generally considered as a platform or a framework which solves Big Data issues.

Use of Dataproc and Dataflow for Hadoop Ecosystem in GCP

Dataproc for Hadoop:

Dataproc is considered as the Managed Hadoop for the cloud. By using Dataproc in GCP, we can run Apache Spark and Apache Hadoop clusters on Google Cloud Platform in a powerful and cost-effective way. Dataproc is a managed Spark and Hadoop service that allows the user to create clusters quickly, and then hand off the cluster management to the service.

Cloud Dataproc is best for the environments which depend on specific components of the Apache big data ecosystem like Tools/packages, Pipelines, Skill sets of existing resources.

Dataflow for Hadoop:

There is another service provided by Google for Hadoop, which is Cloud Dataflow. Dataflow is generally used when the user wants to perform Stream processing (ETL) and Batch processing (ETL) both. In Dataproc, we can only perform Batch processing.

Also, Dataflow is used when the user wants a preprocessed workload for machine learning with Cloud ML Engine.

Dataflow uses Apache Beam and supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.

Cost effective way of using GCP for Hadoop

As, in GCP, we can use Google Cloud Storage instead of HDFS (Hadoop Distributed File System), there is no need to keep the clusters activated after the job is completed. We can delete the clusters once the job of the cluster is completed; so by that, we can save a huge amount of expense on the clusters.

In GCP, we can use a type of VM instance known as preemptible VM instance by which we can use to create the worker nodes of the cluster which will run for a maximum of 24 hours and will automatically be deleted. As a result of the fault tolerance feature, the Dataproc will introduce another preemptible instance of similar feature and machine type. So since the preemptible instances are very cheap, we, as a user, can save loads of money.

Advantages of migrating Hadoop to GCP

Built-in tools for Hadoop:

GCP’s Cloud Dataproc is a managed by Hadoop and Spark environment. Users can use Cloud Dataproc to run most of their existing jobs with minimal changes, so the users don’t need to alter all of the Hadoop tools they already know.

Automatically managed hardware and configuration:

When a user runs Hadoop on GCP, all the worries related to physical hardware are handled by the GCP. The user just needs to specify the configuration of the cluster, and Cloud Dataproc allocates resources required. Later on, the user can scale the cluster at any point of time.

Version simplification management:

One of the most complex parts of managing a Hadoop cluster is keeping the open source tools up to date and working together. When users use Cloud Dataproc, much of that is managed and handled for them by Cloud Dataproc versioning. So it saves their time and extra efforts.

Flexible job configuration:

In a typical Hadoop setup, many purposes are served by a single cluster. But after moving to GCP, users can focus on individual tasks, creating as many clusters as they need. This removes the complexity of maintaining a single cluster with growing dependencies and software configuration interactions.

Monitoring the process:

Users can use Stackdriver Monitoring to understand the performance and health of their Cloud Dataproc clusters and examine HDFS, YARN, and Cloud Dataproc job and operation metrics.

Cloud Dataproc Cluster resource metrics are automatically enabled on the clusters of Cloud Dataproc, and the users can use this monitoring to see these metrics.

Planning of migration from an on-premise Hadoop solution to GCP

Migrating from an on-premises Hadoop solution to GCP needs a shift in approach. Generally, the on-premises Hadoop system contains a monolithic cluster which supports many workloads across multiple business areas because of which the system becomes more complex with time and will require administrators to make everything work, in the monolithic cluster. When users move their Hadoop system to GCP, they will be able to reduce the complexity of the administration. But to get the most efficient processing in GCP with the minimal cost, the users have to give emphasis on the structure of their data and jobs.

So the simplest solution for the users will be to use a persistent Cloud Dataproc cluster to replicate their on-premises setup. However, there are some limitations to this approach:

  • It is recommended by Google to store the data in Cloud Storage instead of keeping the data in a persistent HDFS cluster using Cloud Dataproc, as it is more expensive.
  • Also, keeping data in HDFS cluster limits the ability of users to use their data with other GCP products.
  • For particular use cases, it is more efficient and economical to replace some of the open-source-based tools with the related GCP services.
  • It is easier to manage the targeted clusters that serve individual jobs or job areas than using a single persistent Cloud Dataproc cluster for the jobs.

From the above discussion, we can conclude that shifting the on-premise Hadoop Ecosystem to Google Cloud Platform can save loads of money, precious time, and reduce the complexity of the process. And Cloud Dataproc easily integrates the typical Hadoop tools with other Google Cloud Platform (GCP) services, giving the users a powerful and complete platform for data processing, analytics, and machine learning.

Related blogs:

What is the Virtual Data Center in Google Cloud Platform (GCP)?

Basic Concepts of Networking in Google Cloud Platform

Cloud Archives

Tudip’s Cloud Portfolio