Disclaimer: I have worked for a Hadoop vendor called Hortonworks who is Cloudera now and I have worked many customers in my technical pre-sales role where I have sold Hadoop and also, helped customers on comparing Hadoop vs EMR, Dataproc
Let’s talk about Cloud giant offerings on Hadoop. EMR from AWS, Dataproc from Google and HDInsight from Microsoft/Azure
Dataproc — https://cloud.google.com/dataproc/
Why do we need to move to Cloud-based Hadoop solution?
To reduce CapEx and OPEX cost. There is no need to keep servers up and running if there is no workload. Also, having a centralized data storage layer with the flexibility of spinning up the compute layer without moving the data around is a powerful idea if done right.
Imagine shutting down the on-prem Hadoop cluster during off-peak hours or idle time.
In this article, let’s take a look at GCP
Migrating On-Premises Hadoop Infrastructure to Google Cloud Platform | Migrating Hadoop to GCP |…
Guidance on moving on-premises Hadoop workloads to Google Cloud Platform
1 — Move data to cloud buckets.
Cloud Data Transfer Service | Fast Data Migration | Cloud Migration Products | Google Cloud
Cloud Data Transfer services help enterprises move data to Google Cloud Platform quickly and securely. Move your data…
gsutil cp Iamme.csv gs://nstesting
Quickstart: Using the gsutil tool | Cloud Storage | Google Cloud
Whether your business is early in its journey or well on its way to digital transformation, Google Cloud's solutions…
Have you heard of distcp? If not then please google it.
Using DistCp to copy your data to Cloud Storage
Migrating HDFS Data from On-Premises to Google Cloud Platform | Migrating Hadoop to GCP | Google…
GCP doesn't use the same fine-grained permissions for files that you can achieve with HDFS on-premises. The least…
neerajsabharwal@gcp ~> gcloud dataproc clusters create nsab
neerajsabharwal@gcp ~> gcloud dataproc clusters list
NAME WORKER_COUNT PREEMPTIBLE_WORKER_COUNT STATUS ZONE
nsab 2 RUNNING us-central1-a
let’s connect to master node, create test data
copy test data to gcs
hadoop distcp hdfs://nsab-m.c.demons123.internal:8020/gcp_mig/ gs://gcp_mig_hadoop/
let’s play around with hive
In this article, we created the dataproc cluster, copied data from hdfs to gs and then created a hive table on gs.
Now, if you have on-prem Hadoop cluster then using cloud VPN you can set up the connectivity between on-prem and your cloud setup.
There is a lot more work to do when it comes to putting together an end to end story. I will find time to play around more and create more articles with the demo.