Oracle to PostgreSql: Migrate Oracle to PostgreSQL using Vertex AI notebooks and GCP Dataproc Serverless
Dataproc Templates, in conjunction with VertexAI notebook and Dataproc Serverless, provide a one-stop solution for migrating data directly from Oracle Database to any PostgreSQL databas e.
We have developed a Vertex AI notebook (Jupyter notebook) solution which uses dataproc serverless for migration. You just need to provide necessary parameters and the notebook will help you migrate a complete Oracle data-Warehouse to Cloud SQL instance of PostgreSQL.
This notebook solution is built on top of Vertex AI Jupyter Notebook and Google Cloud’s Dataproc tech stack provided by GCP.
Before we begin, here are some added benefits of this notebook:
- Automatically generates list of tables from metadata. Alternatively, user can supply list of tables.
- Identifies current primary key column name, and partitioned read properties.
- Automatically uses partition reads if exceeds threshold.
- Divides migration into batches and parallely migrates multiple tables.
- Notebook allows you to choose modes i.e. appending data or overwrite.
- Notebook gives you the option to specify the postgreSQL schema name.
Prerequisites :
Enable below services in GCP project from API Console:
- Compute Engine API
- Dataproc API
- Vertex-AI API
- Vertex Notebooks API
2. Create a User-Managed Notebook in Vertex AI Workbench.
3. Clone Dataproc template repository using the GIT tab as shown in the below image
or open a terminal from the launcher window and clone using git clone.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
4. From the folders tab, open OracleToPostgreSQl_notebook.ipynb present in the path: dataproc-templates/notebooks/oracle2postgres
5. Notebook includes a step-by-step procedure to assist with data migration from Oracle to PostgreSQL. Execute each step one at a time. The notebook contains detailed instructions. The migration steps are listed below:
Step 1: Install required packages
Some packages needed for the migration need to be installed separately as they are not available in the notebook e.g. SQLAlchemy, JDK, etc.
!pip3 install SQLAlchemy
!pip3 install --upgrade google-cloud-pipeline-components kfp --user -q
!sudo apt-get install libpq-dev --yes
!pip3 install psycopg2
!pip3 install cx-Oracle
Oracle client Installation
%%bash
sudo mkdir -p /opt/oracle
sudo rm -fr /opt/oracle/instantclient*
cd /opt/oracle
sudo wget --no-verbose https://download.oracle.com/otn_software/linux/instantclient/instantclient-basic-linuxx64.zip
sudo unzip instantclient-basic-linuxx64.zip
INSTANT_CLIENT_DIR=$(find /opt/oracle -maxdepth 1 -type d -name "instantclient_[0-9]*_[0-9]*" | sort | tail -1)
test -n "${INSTANT_CLIENT_DIR}" || echo "ERROR: Could not find instant client"
test -n "${INSTANT_CLIENT_DIR}" || exit 1
sudo apt-get install libaio1
sudo sh -c "echo ${INSTANT_CLIENT_DIR} > /etc/ld.so.conf.d/oracle-instantclient.conf"
sudo ldconfig
export LD_LIBRARY_PATH=${INSTANT_CLIENT_DIR}:$LD_LIBRARY_PATH
Step 2: Import Libraries : This notebook requires the Oracle driver and PostgresSQL Connector jar. Installation information is present in the notebook.
import sqlalchemy
import google.cloud.aiplatform as aiplatform
from kfp import dsl
from kfp.v2 import compiler
from datetime import datetime
import time
import copy
import json
import math
import pandas as pd
from google_cloud_pipeline_components.experimental.dataproc import DataprocPySparkBatchOp
Step 3: Assign Parameters : Below configurations need to be set before running the notebook:
Step 3.1: Common Parameters
PROJECT
: GCP project-idREGION
: GCP regionGCS_STAGING_LOCATION
: GCS staging location to be used for this notebook to store artifactsSUBNET
: VPC subnetJARS
: list of jars. For this notebook Oracle driver and PostgreSQL jar connector with the Dataproc template jarsMAX_PARALLELISM
: Parameter for number of jobs to run in parallel default value is 2SERVICE_ACCOUNT
: Custom service account email to use for vertex ai pipeline and dataproc job with permissions mentioned in notebook
# Get GCP Project
PROJECT = "<project-id>"
REGION = "<region>"
GCS_STAGING_LOCATION = "<gs://bucket/[folder]>"
SUBNET = "<projects/{project}/regions/{region}/subnetworks/{subnet}>"
MAX_PARALLELISM = 5 # max number of tables which will migrated parallelly
SERVICE_ACCOUNT = ""
# Do not change this parameter unless you want to refer below JARS from new location
JARS = [GCS_STAGING_LOCATION + "/jars/ojdbc8-21.7.0.0.jar", GCS_STAGING_LOCATION + "/jars/postgresql-42.2.6.jar"]
# If SERVICE_ACCOUNT is not specified it will take the one attached to Notebook
if SERVICE_ACCOUNT == '':
shell_output = !gcloud auth list 2>/dev/null
SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()
print("Service Account: ",SERVICE_ACCOUNT)\
Step 3.2: Oracle Parameters
ORACLE_HOST
: Oracle instance ip addressORACLE_PORT
: Oracle instance portORACLE_USERNAME
: Oracle usernameORACLE_PASSWORD
: Oracle passwordORACLE_DATABASE
: Name of database/service for Oracle connectionORACLETABLE_LIST
: List of tables you want to migrate eg: ['table1','table2'] else provide empty list for migration whole database eg : []
ORACLE_HOST = "<host>"
ORACLE_PORT = "<port"
ORACLE_USERNAME = "<username>"
ORACLE_PASSWORD = "<password>"
ORACLE_DATABASE = "<database>"
ORACLETABLE_LIST = [] # leave list empty for migrating complete database else provide tables as ['table1','table2'
Step 3.3: PostgreSql Parameters
POSTGRES_HOST
: Postgres instance ip addressPOSTGRES_PORT
: Postgres instance portPOSTGRES_USERNAME
: Postgres usernamePOSTGRES_PASSWORD
: Postgres passwordPOSTGRES_DATABASE
: Name of database/service for Postgres connectionOUTPUT_MODE
: Output write mode (one of: append,overwrite,ignore,errorifexists)(Defaults to overwrite)BATCH_SIZE
: JDBC output batch size. Default set to 1000
Step 4: Generate Oracle Table List
This step generates the list of tables for migration. All tables from the database are selected if ORACLE_TABLE_LIST
parameter in step 3.2 is an empty list.
Step 5: Get Primary Keys for tables from Oracle Source
This step gets the primary keys for tables present in ORACLE_TABLE_LIST
parameter from ORACLE_DATABASE
Step 6: Get Row Count of Tables and identify Partition Columns
This step uses PARTITION_THRESHOLD
(default value is 1 million) parameter and any table having rows greater than PARTITION_THRESHOLD
will be partitioned based on Primary Key
Step 7: Download JAR files and Upload to GCS (only required to run one-time)
This step creates required JAR files and uploads it to GCS_STAGING_LOCATION
defined in step 3.1
Step 8: Calculate Parallel Jobs for Oracle to PostgreSQL
Based on the MAX_PARALLELISM
parameter set in step 3.1, This step calculates parallel jobs for Oracle to PostgreSQL migration. Basically the number of Oracle tables that will be migrated in parallel to PostgreSQL instance.
Step 9: Execute Pipeline to Migrate tables from Oracle to PostgresSQL
This step starts the Vertex AI pipeline execution for Oracle table migration. Pipeline link is generated for each job. We can also see the job running in Dataproc UI in Batches section
Step 10: Get status for tables migrated from Oracle to PostgreSQL
This step gets the status of jobs that are executed in step 9
Step 11: Validate row counts of migrated tables from Oracle to PostgresSQL
This steps helps in validating numbers of records of each table that is migrated from Oracle to PostgreSQL.
→ Get in touch with dataproc-templates-support-external@googlegroups.com : for any questions or recommendations.
Thanks for reading, Until next time , take care & stay connected :)