Migrate Oracle to Big Query using Dataproc and Sqoop

Samet Karadag
Google Cloud - Community
5 min readAug 26, 2020

Want to migrate Oracle Database to Big Query? Don’t know how to move data from Oracle to Big Query? Want to do this with GCP native tools?

Keep reading, you are on the right post!

Here are some advantages using Dataproc and Sqoop for your Oracle migrations;

  1. Dataproc will be ready in couple of minutes to use
  2. Has built-in GCS connectors, so the data will be in GCS, ready to be consumed by Big Query. Just use “gs://”, instead of “hdfs://”
  3. Sqoop supports many relational databases and mainframe
  4. Sqoop has great scalability features like configuring splits and parallelism.

Cloud Composer or Cloud Data Fusion can also be used for Oracle to Big Query migrations. I will cover those in next blog posts.

1. Create Dataproc cluster

First create a dataproc cluster. You can use a single node dataproc cluster. We do not need many nodes as we are not doing big data analysis here.

Performance tip; Probably the network bandwidth or Oracle Server (CPU/IO) will be the bottleneck. The number of CPUs of the dataproc cluster probably will not matter, but should be proportional with the desired parallelism. Choose the region carefully for the dataproc, GCS buckets and BQ dataset, it should be close to your Oracle database location to provide better network latency.

HA tip; If this is not production critical (i.e. downtime is not that critical) you do not need HA (3 master nodes)

Billing tip; E2 instances are cheaper than N2 instances. We do not need to attach GPU or other N2 capabilities, so you are good with E2.

Sizing Tip; Data will be written to GCS, HDFS size does not matter.

gcloud dataproc clusters create sqoop-cluster — enable-component-gateway — region europe-west4 — subnet default — zone europe-west4-b — single-node — master-machine-type e2-standard-8 — master-boot-disk-size 500 — image-version 1.3-debian10 — project <project-id>

2. Get Oracle JDBC driver

Download Oracle JDBC driver.

Download the ojdbc8.jar or ojdbc7.jar version for the purposes of this tutorial

3. Upload Oracle JDBC driver to GCS

Create a bucket in GCS and upload file to GCS;

From cloudshell (or your terminal if you have gsutil installed in);

gsutil mb -l europe-west4 gs://my-sqoop-jar-bucket
wget https://download.oracle.com/otn-pub/otn_software/jdbc/ojdbc8.jar
gsutil cp ojdbc8.jar gs://my-sqoop-jar-bucket/

4. Download and upload sqoop jars

Sqoop can be installed using this init action during provision time. But for the purpose of this tutorial, we will manually download sqoop jars and upload to GCS.

Download sqoop jars from here. Choose sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz from any mirror. E.g; http://apache.cs.uu.nl/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz)

BUCKET=gs://my-sqoop-jar-bucket
wget https://apache.belnet.be/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
tar xvfz sqoop*
gsutil cp sqoop-1.4.7.bin__hadoop-2.6.0/sqoop-1.4.7.jar $BUCKET/
gsutil cp sqoop-1.4.7.bin__hadoop-2.6.0/lib/* $BUCKET/

5. Extract and load a table from your Oracle database into GCS

We will use delimited file type (text file) as AVRO format may have issues with DATE and NUMBER columns while loading into Big Query.

Create a GCS bucket to store extract files, and submit your Sqoop job;

gsutil mb -l europe-west4 gs://my-oracle-extractJAR_BUCKET=gs://my-sqoop-jar-bucket
STAGING_BUCKET=gs://my-oracle-extract
libs=`gsutil ls $JAR_BUCKET | paste -sd, — `
JDBC_STR=jdbc:oracle:thin:<username>/<password>@<IP>:1521/<SERVICE>
TABLE=EMPLOYEES
SCHEMA=HR
gcloud dataproc jobs submit hadoop — cluster=sqoop-cluster — region=europe-west4 — class=org.apache.sqoop.Sqoop — jars=$libs — import -Dmapreduce.job.user.classpath.first=true -Dorg.apache.sqoop.splitter.allow_text_splitter=true — connect=$JDBC_STR — target-dir=$STAGING_BUCKET/$TABLE — table=$SCHEMA.$TABLE — enclosed-by ‘\”’ — escaped-by \” — fields-terminated-by ‘|’ — null-string ‘’ — null-non-string ‘’ — as-textfile

Sqoop has a bug which causes duplicate double-quotes when “ character is used both as enclosing and escape character. “ is required as the escape character for quote terminated fields in Big Query. A workaround can be replacing double - double quotes with single double quotes by an update in BQ after load. Sqoop gets Oracle “date” types as “timestamp” types. So target Big Query table should have “Timestamp” columns instead of “Date”. After the initial load that can be easily converted to Date using CTAS.

6. Create dataset and table in Big Query

Create a dataset if you haven’t done already.

Create the table in Big Query. You need to convert Number types to NUMERIC, and other Oracle types to corresponding BQ types.

Good news; Automatic Oracle to BQ schema converter will be the next blog post, stay tuned.

7. Load data from GCS to Big Query

Load Big Query table from GCS;

bq load -source_format=CSV — allow_quoted_newlines — field_delimiter=’|’ BQ_DATASET.$TABLE “$STAGING_BUCKET/$TABLE/part-*”

8. Export all tables within a schema

gcloud dataproc jobs submit hadoop — cluster=sqoop — region=europe-west4 — class=org.apache.sqoop.Sqoop — jars=$libs — import-all-tables -Dmapreduce.job.user.classpath.first=true -Dorg.apache.sqoop.splitter.allow_text_splitter=true — connect=$JDBC_STR — warehouse-dir=$STAGING_BUCKET/$SCHEMA — enclosed-by ‘\”’ — escaped-by \” — fields-terminated-by ‘|’ — null-string ‘’ — null-non-string ‘’ — as-textfile

9.Import all tables into BQ

First, you need to create all the tables in BQ (Good news; will be the next blog post). Then you can use the shell script below to load all the tables automatically.

FILTER_DS=$1
FILTER_TABLE=$2
mkdir -p .schemas
files=`gsutil ls -r gs://my-oracle-extract/ | grep "\:$" `
for f in $files
do
dataset=`echo $f | cut -d/ -f4`
dataset="${dataset//c_lew/p_lew}"
table=`echo $f | cut -d/ -f5`
colon=`echo $f | cut -d/ -f6`
if [[ ! -z $FILTER_DS ]] && [[ $FILTER_DS != $dataset ]] ; then
echo "skipping " $dataset
continue
fi

if [[ ! -z $FILTER_TABLE ]] && [[ $FILTER_TABLE != $table ]] ; then
echo "skipping " $table
continue
fi
if [[ ":" == "$colon" ]] ; then
bq show --format=prettyjson $dataset.$table | jq '.schema.fields' > .schemas/$dataset.$table.json 2>/dev/null
OUT=$?
if [ $OUT -eq 0 ];then
f_with_wildcard="${f//\/\://*}"
echo "Loading $dataset.$table"
bq load --field_delimiter='|' --source_format=CSV $dataset.$table $f_with_wildcard .schemas/$dataset.$table.json
else
echo "No table definition for $dataset.$table"
fi
fi
done

Hope this article was useful for you, I would appreciate your comments and feedback!

--

--