Azure/GCP/AWS/Terraform/Spark

Build a Hybrid Multi-Cloud Data Lake and Perform Data Processing Using Apache Spark

Create a Multi-Cloud Data Lake using Terraform and run a configuration driven Apache Spark data pipeline on COVID-19 data

Kapil Sreedharan
The Startup

--

Five years back when I started working on enterprise big data platforms, the prevalent data lake architecture was to go with a single public cloud provider or on-prem platform. Quickly these data lakes grew into several terabytes to petabytes of structured and unstructured data(only 1% of unstructured data is analyzed or used at all). On-prem data lakes hit capacity issues, while single cloud implementations risked so-called vendor lockin.

Today, Hybrid Multi-Cloud architectures that use two or more public cloud providers are the preferred strategy. 81% of public cloud users reported using two or more cloud providers.

Cloud providers offer various tools and services to move data to the cloud as well as to transform the data. Here in this article, we will create a cloud-native data pipeline using Apache Spark.

Create a Multi-Cloud Data Platform and run a Spark Processing job on it

The use case we are going to build here is:

  • Create a hybrid cloud infrastructure using Terraform
  • Read European Center for Disease Prevention and Control Covid19 Dataset publically available in Azure open datasets
  • Extract US, Canadian and German data from the above dataset
  • Create and write the country wise data into separate tables on Amazon Redshift, Azure SQL, and Google Bigquery
  • Use your favorite data analytics tools on AWS, Azure, or GCP to explore the mysteries of Covid19.

We will create a Spark ETL Job on Google Cloud Dataproc, which will load ECDC Covid19 data from Azure Blob storage, transform it and then load it to 3 different cloud stores: Amazon Redshift, Azure SQL, and Google BigQuery.

Here we automate cloud infrastructure provisoining using Infrastructure as Code(IaC)with Terraform . IaC allows you to easily spin up and shutdown clusters, this way you only run the cluster when you use it.

Prerequisite:

This use case can be built and run on AWS/Azure/GCP resources which qualify under free tier. Sign up for Azure/ GCP and AWS public cloud services free credits. Follow the links below to create each of them.

  1. Sign up with GCP, get the 300$ free trial

GCP Free Tier — Free Extended Trials and Always Free | Google Cloud
Get hands-on experience with popular products, including Compute Engine and Cloud Storage, up to monthly limits. These…cloud.google.com

Once you signed up and logged in to the GCP console. Activate Cloud Shell by clicking on the icon highlighted in the below screenshot. Cloud Shell provides command-line access to a virtual machine instance, and this is where we are going to set up our use case. (Alternatively, you can also do this on your laptop terminal)

a. Create a new GCP project

Setup the env variables

Run the following command in the cloud shell terminal:

export PROJECT_NAME=${USER}-dataflow
export TF_ADMIN=${USER}-TFADMIN
export TF_CREDS=~/.config/gcloud/${TF_ADMIN}-terraform-admin.json
export PROJECT_BILLING_ACCOUNT=YOUR_BILLING_ACCOUNT
## To get YOUR_BILLING_ACCOUNT run
gcloud beta billing accounts list
## To get YOUR_BILLING_ACCOUNT run
gcloud beta billing accounts list

Create and enable project:

gcloud projects create ${PROJECT_NAME} --set-as-default
gcloud config set project ${PROJECT_NAME}
gcloud beta billing projects link $PROJECT_NAME \
--billing-account ${PROJECT_BILLING_ACCOUNT}

Create a terraform service account:

gcloud iam service-accounts create terraform \
--display-name "Terraform admin account"
gcloud iam service-accounts keys create ${TF_CREDS} \
--iam-account terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--user-output-enabled false

Grant terraform service account permission to view and manage cloud storage:

gcloud projects add-iam-policy-binding ${PROJECT_NAME} \
--member serviceAccount:terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--role roles/viewer
gcloud projects add-iam-policy-binding ${PROJECT_NAME} \
--member serviceAccount:terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--role roles/storage.admin

gcloud projects add-iam-policy-binding $PROJECT_NAME \
--member serviceAccount:terraform@${PROJECT_NAME}.iam.gserviceaccount.com \
--role roles/bigquery.dataOwner \
--user-output-enabled false

Enable Dataproc API service:

gcloud services enable dataproc.googleapis.com
gcloud services enable bigquery-json.googleapis.com

b. Create Dataproc cluster and BigQuery data warehouse using terraform:

Install terraform on the cloud shell terminal(Debina VM), execute

Clone the bellow project for the terraform scripts to create Dataproc and Bigquery

cd ~
git clone https://github.com/ksree/dataflow-iac.git

Run terraform to create Dataproc cluster, BigQuery data warehouse, and GCS bucket

cd ~/dataflow-iac/dataproc
terraform init
terraform apply -auto-approve \
-var="project_name=$PROJECT_NAME" \
-var="bucket_name=${PROJECT_NAME}_file_output_store"

2. Sign up with Azure for free

Create your Azure free account today | Microsoft Azure
Test and deploy enterprise apps Use Azure Virtual Machines, managed disks, and SQL databases while providing high…azure.microsoft.com

Terraform supports a number of different methods for authentication to Azure:

We will use the Service Principal authentication method because this will come in handy when we what to automate this whole process(think CICD)

Register terraform with Azure AD and create a service principal * :

a. Create Application and Service Principal:

Navigate to the Azure Active Directory overview within the Azure Portal — then select the App Registrations blade. Click the New registration button at the top to add a new Application within Azure Active Directory. On this page, set the following values then press Create:

  • Name — DataflowTerraformApp
  • Supported Account Types — this should be set to “Accounts in this organizational directory only (single-tenant)”
  • Redirect URI — you should choose “Web” in for the URI type. the actual value can be left blank

b. Generating a Client Secret for the Azure Active Directory Application

Now that the Azure Active Directory Application exists we can create a Client Secret that can be used for authentication — to do this select Certificates & secrets. This screen displays the Certificates and Client Secrets (i.e. passwords) which are associated with this Azure Active Directory Application.

Click the “New client secret” button, then enter a short description, choose an expiry period, and click “Add”. Once the Client Secret has been generated it will be displayed on screen — the secret is only displayed once so be sure to copy it now (otherwise you will need to regenerate a new one). This is the client_secretyou will need.

export ARM_CLIENT_ID=”d9644b52-ae20–410f-bf70–32d0e324084d”
export ARM_CLIENT_SECRET=”XXXXXXXXXXXXXXXX”
export ARM_TENANT_ID=”fa6dd69f-ba26–418c-8dc5–525f88d2d963"

Assign a role to the application
To access resources in your subscription, you must assign a role to the application. Here we will assign a role at the subscription scope.

a). Search for and select Subscriptions or select Subscriptions on the Home page

b). On the subscriptions page, click on your free subscription

Store the subscription ID as environment variables, as bellow:

For me the value is:

export ARM_SUBSCRIPTION_ID=”728ed9d5-a71a-4f39-b25b-a293943d7a06"

c). Select Access control (IAM).

d). Select Add role assignment.

For the role select Owner(Contributor role will not work)

Select DataflowTerraformApp

e). Select Save to finish assigning the role. You see your application in the list of users with a role for that scope.

Your service principal is set up.

On Google Cloud Shell terminal, run the below commands to set up the environment variables for Azure auth

export ARM_CLIENT_ID=""   #Fill in your client secret
export ARM_CLIENT_SECRET="" #Fill in your client secret
export ARM_TENANT_ID="" #Fill in your tenant id
export ARM_SUBSCRIPTION_ID="" #Fill in your subscription id

On Google Cloud Shell terminal, run the bellow terraform command to create Azure SQL database

cd ~/dataflow-iac/azure
export ip4=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
terraform init
terraform plan
terraform apply -auto-approve

This will create your Azure SQL Database, and you will see the output as bellow

Verify Azure SQL installation:

Search for and select Azure SQL, or select Azure SQL on the Home page

On the Azure SQL Home page, you will see the newly created SQL server and database.

3. Sign up with AWS free tier

Follow the AWS documentation to create a user with Programmatic access and Administrator permissions by attaching the AdministratorAccess policy directly. *

Create a user DataflowTerraformApp, and attach the existing policies directly:

When you create the user, you will receive an Access Key ID and a secret access key. Create new environment variables as bellow, and run it in Google Terminal Shell

export AWS_ACCESS_KEY_ID="YOUR_ACCESS_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRECT_ACCESS_KEY"

Create Amazon Redshift cluster and database:

cd ~/dataflow-iac/aws
terraform init
terraform apply -auto-approve

Verify the Redshift cluster on your AWS console:

Finally, we are done with the infrastructure setup. We have created the required resources on GCP, AWS, and Azure.

The next step is to set up the Spark Job.

Download and build the spark ETL framework.

https://github.com/ksree/dataflow.git

Execute the following command on your cloud shell terminal:

sudo apt-get install -y openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
cd ~
git clone https://github.com/ksree/dataflow.git
cd ~/dataflow/
#Update the storage bucket name in the covid job config file
sed -i -e 's/<BUCKET_NAME>/'${PROJECT_NAME}'_file_output_store/g' ~/dataflow/src/main/resources/config/covid_tracking.yaml
mvn clean install -DskipTests

Output:

Run spark job

Execute the below command on cloud shell

Verify the output:

Azure SQL Server

a. On the Azure SQL page, click on the database dataflowazuresqldatabase

b. On the SQL Database page, click on Query editor to open

You will see the two newly created tables: dbo.casesInCanada and dbo.casesInUs

2. AWS Redshift:

Login to Amazon console, Redshift Query editor, to view the 2 new tables generated.

3. GCP BigQuery

Login to GCP BigQuery console to view your newly generated tables

Awesome !!!

Now that we are done, its time to terminate all the cloud resources that we created.

#Terminate AWS resources
cd ~/dataflow-iac/aws/
terraform destroy -auto-approve
#Terminate Azure resources
cd ~/dataflow-iac/azure/
terraform destroy -auto-approve
#Terminate GCP resources
cd ~/dataflow-iac/gcp/
terraform destroy -auto-approve \
-var="project_name=$PROJECT_NAME" \
-var="bucket_name=${PROJECT_NAME}_file_output_store"

Recap:

Here we created a hybrid cloud infrastructure and used Apache Spark to read process and write real-time Covid19 dataset into three different cloud storage locations.

Infrastructure Provisioning: Infrastructure as Code(IaC) to provision and manage multi-cloud infrastructure using Terraform

Data Processor: Apache Spark running on Google Dataproc

Data Source: ECDC Covid-19 dataset on Azure Blog Storage

Data Sink/Destination: We wrote to 3 different cloud storage.

  1. GCP BigQuery
  2. Amazon Redshift
  3. Azure SQL

--

--