Migrating Workloads to GCP : Part III: Reference Use Cases of DW and Data Lake Migration

Published in

Google Cloud - Community

8 min readJul 5, 2022

Thanks for reading my first blog on GCP migration series. If you have missed to read it then you can read the first blog @ https://poojakelgaonkar.medium.com/migrating-workloads-to-gcp-part-i-planning-migrations-ae8e301d207b

You can read out second blog @ https://poojakelgaonkar.medium.com/migrating-workloads-to-gcp-part-ii-phases-of-migration-e9167a039316

This blog is third and last but one chapter of migration series, we are going to discuss some of the reference architectures and use cases of GCP Migration. In earlier blogs, we have learnt that the assessment of existing system , defining goal, defining type of migration applicable to current project/application. We also learnt the phases and designing phases of GCP Migrations.

Let’s consider two use cases as reference architecture discussion.

· Teradata DW migration to Google BigQuery

· Bigdata Data Lake Migration to Google Cloud Platform

What is common in evaluating these migration platforms? When we are discussing about these use cases, let’s review the migration phases and their outcomes for migration.

Planning Migrations –

a. Defining Current state of application –

i. Use case -1 : Teradata migration

1. Define current state of system — How EDW is setup? What is current volume/size of EDW system? What are the source integrations, what are target integrations? How many users are there accessing TD system? How many processes/applications running on TD? what are full load, delta loads/incremental loads? How source data is brought to system? Which tools/utilities used?

Let’s consider -

Sources -> Flat files, Semi-structured files, Other RDBMS sources

- TD utilities used to bring in flat files to TD staging layer

- ETL tool is used to get data from other RDBMS systems as well as semi-structured files

Target -> Downstream application feeds, Reporting & BI, AI/ML applications

Tableau is used for Reporting & BI application

Feeds generated and sent as Files to FTP server for downstream applications

AI/ML team access data from a DB layer

Size -> consider > 50TB of active data and >20TB of archival data

Volume growth -> consider 5% of data growth over a month time

2. Transform layers -> how data is staged? What is retention policy? How data archived post retention? Source file retention? What are the layers defined?

Let’s consider –

Source files are retained for 7 days and archived to archive folder

Layers — Staging, raw data, summary data layers

SCD implementation — SCD Type II tables, target tables

Data retention — Active data maintained for 18 months and archived to archival databases

Data Model, tables/views designs — star/snowflake schema or fsldm?

Audit batch control implementation

Pipelines — which tool used? Technology used? — TD native utilities like TPT, Bteq for transformation layers, Informatica jobs used to transform some of complex jobs

Error handling and restart ability — how alerts are configured? When job to be re-run , what steps to be performed to re-run a job

Consider control-m is used as enterprise scheduler to schedule and invoke jobs

3. Consider below sample reference diagram for EDW implemented — current architecture

Use case — 2 : Hadoop Data Lake migration

1. Define current state — similar to EDW , lets list down the current state pointers for Datalake system

- Source integrations — Files integrated — how are they FTP’ed to HDFS?

- RDBMS integrations — Sqoop or some other tools/technologies used to bring in RDBMS data to Datalake system

- Target integrations — how feed/data been accessed from Datalake systems

- What are individual users, application users to existing Datalake systems

- Is there any BI/Analytics running on top of Data Lake

- Size of Datalake — active storage vs archive storage

- Pipelines setup — how pipelines are running on Data Lake ? Which scheduler / orchestrator used to setup jobs?

- Which tools/technologies used to setup pipelines — let’s consider all jobs are Spark jobs and set to run through Oozie workflows and getting scheduled via Control-M

Lets consider following as typical Datalake implementation architecture

Designing Future state — Lets consider above 2 use cases and discuss following possible solutions on GCP migrations –

i. EDW/TD Migration -> Considerations ->

1. Bigquery as DW on Google. Use BQ transfer service to bring in one time or incremental data from existing TD system to BQ

2. GCS as landing/staging layer for source files. Put up a script in place to use gsutil to push files to GCS once arrived on shared locations

3. ETL jobs to be replaced as BQ jobs or continue to use ETLs to load data to BQ

4. TD native jobs for SQL transformations — convert them to be run on BQ as per BQ standards

5. Use airflow to orchestrate the jobs/pipelines

6. Keep the control-m scheduler as -is to invoke airflow jobs as per existing schedule (this wont impact downstream schedule and dependencies hold true)

7. Create service accounts same as application ids needed to run processes in existing TD system

8. Create individual ids as users to TD systems and set up their roles in IAM same as existing TD system roles

9. Downstream integrations to continue as is for BI tools/reporting with target connections switch and tweaking SQL to bring in reports

10. Downstream feed — Use BQ native utility to generate downstream feed/file to be shared to FTP server for downstream applications. Integrate FTP connection and push to server to avoid any impact to downstream application

11. Error handling & Alerting — Consider implementing error handling and alerting through airflow jobs

12. Audit batch control — BQ is not that great with row level operations, for audit batch prefer using cloud sql native utility to setup audit tables and manage audit scripts to make an entry/update job status based on pipeline status

13. Logging — enable stackdriver logging apart from regular job logging to build dashboard using GCP native services

ii. DataLake Migration — Lets consider below native services and utilities to be used in target system design

1. Use GCS buckets as storage service to store source, raw, staging , intermediate and target files

2. Use Dataproc clusters as Hadoop ecosystem to run Hadoop native jobs on GCP

3. Use GCS as HDFS/storage layer while setting up Datalake on GCP

4. Use airflow to orchestrate jobs on GCP

5. Source integrations — For file feeds, write down a quick utility to push files from source server or shared drive or NAS to GCP. Use gsutil or distcp — push or pull based on file size and cluster utilization

6. Source integrations — for RDBMS feeds, continue using it through Sqoop or choose one of GCP native services like Datastream , Dataflow or Data Fusion based on the source system, data size and frequency to bring up the data

7. Source integrations — for real time feeds, continue using native Kafka to push data to GCS or configure it to be integrated with Pub/Sub to bring in real time or near real time data to GCP

8. For downstream analytics or BI applications — consider exploring data through BQ

9. For job processing/pipeline processing — consider having ephemeral clusters in place of static clusters to avoid overbilling and implement cost optimized solution

Target system design — consider below diagram as reference architecture for DW/DataLake system setup on GCP . We can leverage GCP native services , open source to design and migrate existing platform to GCP

2. Implementation or Migration to GCP

a. Once we have defined current state, designed target system then we can start with implementation cycle

b. Avoid big-bang approach

c. Based on planning phase and current system assessment — define dependencies between processes and identify applications/processes to be migrated in phases

d. Let’s consider we have 4 different processes running in existing TD system and we want to migrate them in phases, we observed that application 1 is used to bring in source data and application 3 is used to perform transformations , application 2 is running summarized processes to consolidate data to be shared with downstream applications ,application 4 is used to generate feeds and push to downstream

e. With given consideration — we can plan on moving application 1 first to get all required source data to system, get application 3 planned next followed up by 2 and 1

f. Phased approach is always best approach to migrate and test. Once tested and validated its easy to integrate other applications

Validation is one of the most important key to migration. Validate every single byte transferred, every single code a. converted to run on GCP, validate source as well as target objects along with key intermediate objects of DW systems

b. Define validation cycles before signing off any of migration piece, run 3–10 cycles to ensure all scenarios and variety of data covered to be tested and validated

3. Signing off

a. Based on the validation cycles, results considering signing off every component migrated to GCP

b. Don’t shut down primary system until new system is up and running in parallel for sometime

c. Define parallel execution time — 3months to 12 months to validate system performance , integrations etc.

d. Roll out new system to set of users, use cases and start validating them

e. Once its been signed off by various stake holders, data users then new system is ready to undergo sign off and replace old system

4. Cost Optimization

a. Once cloud migrations are done, most of the people tend to ignore the ongoing costs, utilizations until they get warning alarm of utilization

b. Cost optimization is ongoing activity, setup alerts, dashboards, monitor new system setup on GCP, monitor system utilizations, patterns, cost patterns, scaling policies, archival policies, storage policies etc.

c. Once applications migrated, we will see a saving over period of time, we cant compare it with Day 1 of new system to Day 1000 of old system.

d. We have to setup all necessary alerts, dashboards and enroll to Google benefits of committed usage to gain more $ savings over period of time

In the next blog, I am going to share some of the common challenges with GCP migration. This is last chapter in migration series.

Please don’t forget to share or subscribe the blog if you find this information useful/insightful. You can reach out to me if any questions on migrations or need any references.

About Me :
I am DWBI and Cloud Architect! I have been working with various Legacy data warehouses, Bigdata Implementations, Cloud platforms/Migrations. I am Google Certified Professional Cloud Architect .You can reach out to me @ LinkedIn if you need any further help on certification, GCP Implementations!

Migrating Workloads to GCP : Part III: Reference Use Cases of DW and Data Lake Migration

Written by Pooja Kelgaonkar