Oracle to PostgreSql: Migrate Oracle to PostgreSQL using Vertex AI notebooks and GCP Dataproc Serverless

Published in

Google Cloud - Community

5 min readMay 16, 2023

Dataproc Templates, in conjunction with VertexAI notebook and Dataproc Serverless, provide a one-stop solution for migrating data directly from Oracle Database to any PostgreSQL databas e.

We have developed a Vertex AI notebook (Jupyter notebook) solution which uses dataproc serverless for migration. You just need to provide necessary parameters and the notebook will help you migrate a complete Oracle data-Warehouse to Cloud SQL instance of PostgreSQL.

This notebook solution is built on top of Vertex AI Jupyter Notebook and Google Cloud’s Dataproc tech stack provided by GCP.

Before we begin, here are some added benefits of this notebook:

Automatically generates list of tables from metadata. Alternatively, user can supply list of tables.
Identifies current primary key column name, and partitioned read properties.
Automatically uses partition reads if exceeds threshold.
Divides migration into batches and parallely migrates multiple tables.
Notebook allows you to choose modes i.e. appending data or overwrite.
Notebook gives you the option to specify the postgreSQL schema name.

Prerequisites :

Enable below services in GCP project from API Console:

Compute Engine API
Dataproc API
Vertex-AI API
Vertex Notebooks API

2. Create a User-Managed Notebook in Vertex AI Workbench.

3. Clone Dataproc template repository using the GIT tab as shown in the below image

or open a terminal from the launcher window and clone using git clone.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git

4. From the folders tab, open OracleToPostgreSQl_notebook.ipynb present in the path: dataproc-templates/notebooks/oracle2postgres

5. Notebook includes a step-by-step procedure to assist with data migration from Oracle to PostgreSQL. Execute each step one at a time. The notebook contains detailed instructions. The migration steps are listed below:

Step 1: Install required packages

Some packages needed for the migration need to be installed separately as they are not available in the notebook e.g. SQLAlchemy, JDK, etc.

!pip3 install SQLAlchemy
!pip3 install --upgrade google-cloud-pipeline-components kfp --user -q
!sudo apt-get install libpq-dev --yes
!pip3 install psycopg2
!pip3 install cx-Oracle

Oracle client Installation

%%bash
sudo mkdir -p /opt/oracle
sudo rm -fr /opt/oracle/instantclient*
cd /opt/oracle
sudo wget --no-verbose https://download.oracle.com/otn_software/linux/instantclient/instantclient-basic-linuxx64.zip
sudo unzip instantclient-basic-linuxx64.zip
INSTANT_CLIENT_DIR=$(find /opt/oracle -maxdepth 1 -type d -name "instantclient_[0-9]*_[0-9]*" | sort | tail -1)
test -n "${INSTANT_CLIENT_DIR}" || echo "ERROR: Could not find instant client"
test -n "${INSTANT_CLIENT_DIR}" || exit 1
sudo apt-get install libaio1
sudo sh -c "echo ${INSTANT_CLIENT_DIR} > /etc/ld.so.conf.d/oracle-instantclient.conf"
sudo ldconfig
export LD_LIBRARY_PATH=${INSTANT_CLIENT_DIR}:$LD_LIBRARY_PATH

Step 2: Import Libraries : This notebook requires the Oracle driver and PostgresSQL Connector jar. Installation information is present in the notebook.

import sqlalchemy
import google.cloud.aiplatform as aiplatform
from kfp import dsl
from kfp.v2 import compiler
from datetime import datetime
import time
import copy
import json
import math
import pandas as pd

from google_cloud_pipeline_components.experimental.dataproc import DataprocPySparkBatchOp

Step 3: Assign Parameters : Below configurations need to be set before running the notebook:

Step 3.1: Common Parameters

PROJECT : GCP project-id
REGION : GCP region
GCS_STAGING_LOCATION : GCS staging location to be used for this notebook to store artifacts
SUBNET : VPC subnet
JARS : list of jars. For this notebook Oracle driver and PostgreSQL jar connector with the Dataproc template jars
MAX_PARALLELISM : Parameter for number of jobs to run in parallel default value is 2
SERVICE_ACCOUNT : Custom service account email to use for vertex ai pipeline and dataproc job with permissions mentioned in notebook

# Get GCP Project

PROJECT = "<project-id>"
REGION = "<region>"
GCS_STAGING_LOCATION = "<gs://bucket/[folder]>"
SUBNET = "<projects/{project}/regions/{region}/subnetworks/{subnet}>"
MAX_PARALLELISM = 5 # max number of tables which will migrated parallelly 
SERVICE_ACCOUNT = ""

# Do not change this parameter unless you want to refer below JARS from new location
JARS = [GCS_STAGING_LOCATION + "/jars/ojdbc8-21.7.0.0.jar", GCS_STAGING_LOCATION + "/jars/postgresql-42.2.6.jar"]

# If SERVICE_ACCOUNT is not specified it will take the one attached to Notebook
if SERVICE_ACCOUNT == '':
    shell_output = !gcloud auth list 2>/dev/null
    SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()
    print("Service Account: ",SERVICE_ACCOUNT)\

Step 3.2: Oracle Parameters

ORACLE_HOST : Oracle instance ip address
ORACLE_PORT : Oracle instance port
ORACLE_USERNAME : Oracle username
ORACLE_PASSWORD : Oracle password
ORACLE_DATABASE : Name of database/service for Oracle connection
ORACLETABLE_LIST : List of tables you want to migrate eg: ['table1','table2'] else provide empty list for migration whole database eg : []

ORACLE_HOST = "<host>"
ORACLE_PORT = "<port"
ORACLE_USERNAME = "<username>"
ORACLE_PASSWORD = "<password>"
ORACLE_DATABASE = "<database>"
ORACLETABLE_LIST = [] # leave list empty for migrating complete database else provide tables as ['table1','table2'

Step 3.3: PostgreSql Parameters

POSTGRES_HOST : Postgres instance ip address
POSTGRES_PORT : Postgres instance port
POSTGRES_USERNAME : Postgres username
POSTGRES_PASSWORD : Postgres password
POSTGRES_DATABASE : Name of database/service for Postgres connection
OUTPUT_MODE : Output write mode (one of: append,overwrite,ignore,errorifexists)(Defaults to overwrite)
BATCH_SIZE : JDBC output batch size. Default set to 1000

Step 4: Generate Oracle Table List

This step generates the list of tables for migration. All tables from the database are selected if ORACLE_TABLE_LIST parameter in step 3.2 is an empty list.

Step 5: Get Primary Keys for tables from Oracle Source

This step gets the primary keys for tables present in ORACLE_TABLE_LIST parameter from ORACLE_DATABASE

Step 6: Get Row Count of Tables and identify Partition Columns

This step uses PARTITION_THRESHOLD(default value is 1 million) parameter and any table having rows greater than PARTITION_THRESHOLD will be partitioned based on Primary Key

Step 7: Download JAR files and Upload to GCS (only required to run one-time)

This step creates required JAR files and uploads it to GCS_STAGING_LOCATION defined in step 3.1

Step 8: Calculate Parallel Jobs for Oracle to PostgreSQL

Based on the MAX_PARALLELISM parameter set in step 3.1, This step calculates parallel jobs for Oracle to PostgreSQL migration. Basically the number of Oracle tables that will be migrated in parallel to PostgreSQL instance.

Step 9: Execute Pipeline to Migrate tables from Oracle to PostgresSQL

This step starts the Vertex AI pipeline execution for Oracle table migration. Pipeline link is generated for each job. We can also see the job running in Dataproc UI in Batches section

Step 10: Get status for tables migrated from Oracle to PostgreSQL

This step gets the status of jobs that are executed in step 9

Step 11: Validate row counts of migrated tables from Oracle to PostgresSQL

This steps helps in validating numbers of records of each table that is migrated from Oracle to PostgreSQL.

→ Get in touch with dataproc-templates-support-external@googlegroups.com : for any questions or recommendations.

Thanks for reading, Until next time , take care & stay connected :)

Oracle to PostgreSql: Migrate Oracle to PostgreSQL using Vertex AI notebooks and GCP Dataproc Serverless

Step 3.2: Oracle Parameters

Written by Vaibhav Singhal