Migrating HDFS Data from On-Premises to Google Cloud Platform

Samadhan Kadam
Petabytz
Published in
4 min readJun 20, 2019

This guide gives an outline of how to move your on-premises Apache Hadoop framework to Google Cloud Platform (GCP). It portrays a relocation procedure that moves your Hadoop work to GCP, yet in addition empowers you to adjust your work to exploit the advantages of a Hadoop framework improved for distributed computing. It likewise presents some basic ideas you have to comprehend to make an interpretation of your Hadoop arrangement to GCP.

Proof of Concept
We started the process with a POC in which we considered existing running infra compatibilities with services offered by the Google Cloud Platform and also planned for elements in our future roadmap.

Key areas covered in POC:

· Data Migration

· HDFC

· HDFC to GCP

What is Data Migration?

Data migration is the process of transferring data between data storage systems, data formats or computer systems. A data migration project is done for numerous reasons, which include replacing or upgrading servers or storage equipment, moving data to third-party cloud providers, website consolidation, infrastructure maintenance, application or database migration, software upgrades, company mergers or data center relocation.

What is Hadoop Distributed File System (HDFS)?

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a reliable means for managing pools of big data and supporting related big data analytics applications.

How we can migrate from HDFC Data from On-premises to GCP?

Here are the recommended steps for migrating your workflows to GCP:

  1. Move your data first
  2. Move your data into Cloud Storage buckets.
  3. Start small. Use backup or archived data to minimize the impact to your existing Hadoop system.
  4. Experiment
  5. Use a subset of data to test and experiment. Make a small-scale proof of concept for each of your jobs.
  6. Try new approaches to working with your data.
  7. Adjust to GCP and cloud-computing paradigms.
  8. Think in terms of specialized, ephemeral clusters.
  9. Use the smallest clusters you can — scope them to single jobs or small groups of closely related jobs.
  10. Create clusters each time you need them for a job and delete them when you’re done.
  11. Use GCP tools wherever appropriate.

Hadoop in the Cloud

Our Hadoop infrastructure — what it was, and what it is now — will be the topic of future blog posts. I’ll just cover the basics here of what we had, and what we built on GCP.

For simplicity, this diagram omits JournalNodes, ZooKeeper and Cloudera management roles. The important notes are that this production cluster:

· Was shared between development teams

· Ran on physical nodes which did not scale with load

· Launched jobs from gateway Virtual Machines

· Stored all data on a federated HDFS (4 HA NameNode pairs)

· Has run continuously (except for upgrades) since 2009

HDFS is without question the most scalable on-premise filesystem, but has drawbacks compared to cloud-native Object Storage (S3, GCS) — namely, if you destroy your instances, you lose the data (unless you also perform Persistent Disk book-keeping). When designing our cloud clusters, we knew we wanted:

· Long-lived but ephemeral clusters (Problems? Blow the cluster away and start over)

· All important data on GCS

· Separate clusters per application team

· Fast autoscaling

· Jobs launched from GKE

So, we ended up with something like this:

Loads of lines, yet I’ll separate the changes:

· Separate clusters run in separate subnets per application team

· Jobs are launched from GKE, not from VMs. Each pod contains only one application — no more manual bin-packing of applications onto VMs!

· HDFS still exists, but only vestigially — YARN uses HDFS for Job Conf storage, the Distributed Cache and application logs. But all application data is stored on GCS

· Since HDFS barely does any work, we only need a couple Data Nodes. Most worker nodes are Node Manager-only. These can quickly scale up and down with application load

--

--