Azure/GCP/AWS/Terraform/Spark

Build a Hybrid Multi-Cloud Data Lake and Perform Data Processing Using Apache Spark

Create a Multi-Cloud Data Lake using Terraform and run a configuration driven Apache Spark data pipeline on COVID-19 data

Kapil Sreedharan
Aug 28, 2020 · 9 min read

Create a Multi-Cloud Data Platform and run a Spark Processing job on it

Prerequisite:

export PROJECT_NAME=${USER}-dataflow
export TF_ADMIN=${USER}-TFADMIN
export TF_CREDS=~/.config/gcloud/${TF_ADMIN}-terraform-admin.json
export PROJECT_BILLING_ACCOUNT=YOUR_BILLING_ACCOUNT
## To get YOUR_BILLING_ACCOUNT run
gcloud beta billing accounts list
## To get YOUR_BILLING_ACCOUNT run
gcloud beta billing accounts list
gcloud projects create ${PROJECT_NAME} --set-as-default
gcloud config set project ${PROJECT_NAME}
gcloud beta billing projects link $PROJECT_NAME \
--billing-account ${PROJECT_BILLING_ACCOUNT}
gcloud iam service-accounts create terraform \
--display-name "Terraform admin account"
gcloud iam service-accounts keys create ${TF_CREDS} \
--iam-account terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--user-output-enabled false
gcloud projects add-iam-policy-binding ${PROJECT_NAME} \
--member serviceAccount:terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--role roles/viewer
gcloud projects add-iam-policy-binding ${PROJECT_NAME} \
--member serviceAccount:terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--role roles/storage.admin

gcloud projects add-iam-policy-binding $PROJECT_NAME \
--member serviceAccount:terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--role roles/bigquery.dataOwner \
--user-output-enabled false
gcloud services enable dataproc.googleapis.com
gcloud services enable bigquery-json.googleapis.com
cd ~
git clone https://github.com/ksree/dataflow-iac.git
cd ~/dataflow-iac/dataproc
terraform init
terraform apply -auto-approve \
-var="project_name=$PROJECT_NAME" \
-var="bucket_name=${PROJECT_NAME}_file_output_store"
export ARM_CLIENT_ID=""   #Fill in your client secret
export ARM_CLIENT_SECRET="" #Fill in your client secret
export ARM_TENANT_ID="" #Fill in your tenant id
export ARM_SUBSCRIPTION_ID="" #Fill in your subscription id
cd ~/dataflow-iac/azure
export ip4=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
terraform init
terraform plan
terraform apply -auto-approve
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRECT_ACCESS_KEY"
cd ~/dataflow-iac/aws
terraform init
terraform apply -auto-approve
sudo apt-get install -y openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
cd ~
git clone https://github.com/ksree/dataflow.git
cd ~/dataflow/
#Update the storage bucket name in the covid job config file
sed -i -e 's/<BUCKET_NAME>/'${PROJECT_NAME}'_file_output_store/g' ~/dataflow/src/main/resources/config/covid_tracking.yaml
mvn clean install -DskipTests
#Terminate AWS resources
cd ~/dataflow-iac/aws/
terraform destroy -auto-approve
#Terminate Azure resources
cd ~/dataflow-iac/azure/
terraform destroy -auto-approve
#Terminate GCP resources
cd ~/dataflow-iac/gcp/
terraform destroy -auto-approve \
-var="project_name=$PROJECT_NAME" \
-var="bucket_name=${PROJECT_NAME}_file_output_store"

Recap:

The Startup

Get smarter at building your thing. Join The Startup’s +785K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Kapil Sreedharan

Written by

Big Data Consultant | Learn | Build | Share https://github.com/ksree

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Kapil Sreedharan

Written by

Big Data Consultant | Learn | Build | Share https://github.com/ksree

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store