Migration path for python based applications in GCP platform

Murli Krishnan
Google Cloud - Community
4 min readOct 10, 2022

The data migration projects have 2 distinct pillars, one being data and other being application running on top of data. The applications can be of varied nature — data, web, request-response model, micro-service..

In this blog, we are going to focus on different migration patterns for python based jobs and where it should be executed.

In a typical lift-shift migrations, the focus is more on getting the things running with minimal changes in the code as much as possible. The reason being to reduce the time required for the migration.

Although this is a valid reason as it impacts the overall cost of migration but it is important to understand choosing the correct approach can bring in cost savings and additional benefits from cloud native service adoption.

The approach should be based on factors like long term cost savings, scalability, availability and maintenance ease.

Migration Paths for python applications

Micro-services pattern & request/response model

Generally these applications are responsible for performing business specific functions like generating recommendations on demand, insights generation or a data correction feedback loop.

These applications are generally running in the form of service.

The approach for these applications would be to check if these can be made container compatible.

Google Kubernetes Engine or Cloud Run can be leveraged for deployment of such applications. Additionally for cloud run, refer the container runtime contract model to understand the pre-requisites of cloud run services.

GKE provides
1. More control on infrastructure and deployments.
2. No timeout bounds
3. GKE also comes in auto-pilot mode which reduces maintenance overhead
4. Advanced configuration options with respect to scalability and resilience

Cloud Run can be leveraged when
1. Infrastructure Maintenance overhead is not desired
2. Quicker deployments and leverage serverless deployments
3. Does not need the advanced features of GKE
4. Does not have traffic 24/7
5. Simple application services

It is possible to use a combination of both the services to achieve the desired results.

Operational Utilities

These are light weight and re-usable application code for performing operational tasks like housekeeping, sending email notifications, event based functions, report delivery and FTP functionalities.

These applications can be deployed on cloud functions.

Cloud functions Generation 2 leverages Cloud Run underneath and provides
1. Integration with Event arc triggers
2. Larger instance support
3. Longer timeout duration

Py-spark code

Spark Ecosystem

Spark — Data Integration and Transformation
Spark based applications can be batch or micro-batch applications leveraging spark-streaming capabilities.
The pyspark code can be migrated and executed on Dataproc (Managed hadoop cluster) or Dataproc serverless option with minimal changes on the code.
The orchestration of code can be performed using dataproc templates or composer DAGS

Spark — SQL Analytics
Spark SQL scripts
are leveraged for performing analytics on Hive.
Although spark SQL code can be easily executed on Dataproc, it is worth considering to move the Spark SQL code to Big Query SQL if the target data warehouse in GCP is Big Query.
This comes with added advantage of having the transformation logic in Big Query and performing ELT instead of ETL

Spark Machine Learning
Spark ML is used for implementing machine learning models.
Dataproc provides support for the spark based machine learning libraries.

Jupyter Notebooks
There are also use-cases where Jupyter notebooks with spark exploration code is used, which requires hosting on the GCP platform.
Vertex AI provides managed notebooks that can be leveraged for hosting the on-premise jupyter notebooks.

Native Python based functionalities

A percentage of application code base is python native code — pandas, numpy, sklearn which is used for scenarios typically like data transformation, feature engineering, data visualisation and excel report generation to name a few.

This code can be executed on adhoc or batch basis.

These scenarios generally does not fit the use-case for GKE or cloud Run.

There are 2 options which can be leveraged for such code deployments

Option 1 — If cloud composer is part of architecture and used for orchestration, python operators or python virtual environment operators can be leveraged for the execution of batch python workloads

This option comes up with some downside
1. Load on composer resources for program execution

Use this option only if resource requirement of the code is less and does not impact the scheduling requirements of composer.

Option 2 — Google Cloud Compute Batch (recently launched) provides mechanism for execution of such adhoc/scheduled batch code.

The underlying instance for Compute batch can be created from instance template with all dependencies installed or as container execution.

The product is still in preview stage

Google Cloud Compute Batch provides
1. Fully Managed Batch Service
2. Auto-scaling capacity
3. No dedicated infrastructure for execution
4. Spot instance usage
5. Container execution support / Instance Template support

Option 3Jupyter Notebooks on Vertex AI can be used for the data exploration and visualisation code. These are developed generally by the data analysts and scientists performing data exploration.

--

--