[Python] Fast export large database tables — using GCP Serverless Dataproc

Import large tables from any JDBC (MySQL, PostgreSQL, MSSQL) to BigQuery

Published in

Google Cloud - Community

3 min readSep 15, 2022

Dataproc Serverless is a great addition to the Google Cloud Dataproc platform. It allows users to run Spark workloads without the provisioning or management of clusters. Dataproc Serverless simply manages all the infrastructure required behind the scenes.

Dataproc Templates provide us common use cases of those kind of workloads, without the need to develop them ourselves. These templates also let us customize and run them quickly.

Introduction

If you need to import/export large tables with 100s of GBs-TBs fast, approach to import/export data out in multiple threads parallelly, and using a robust, proven, open source and hardened mechanism. This post may help you.

Let’s use the JDBCToBigQuery template to export tables in a fast, efficient and multi threaded fashion.

Requirements

Use any machine with Python 3.7+, Git and gcloud CLI pre-installed.
Alternatively use cloud shell, which has those tools pre-installed.
A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled. You can review all the Dataproc Serverless networking requirements here.

Simple Usage

This approach is not multi-threaded, so it works fine with tables smaller than 1Gb in size.

[Recommended] Clone your active instance, or create a read replica. Pausing writes to the source database is recommended for consistency purposes.
Make sure your database is reachable from VPC network. If using a public database, make sure to enable cloud NAT. Please, refer to this for further information.
Create a GCS bucket and staging location for your jar files. Download the JDBC Driver jar for the respective source database, and the BigQuery connector with Spark. Upload those jars files into the GCS Bucket.
Clone Dataproc Templates git repo:

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/python

5. Obtain authentication credentials:

gcloud auth application-default login

6. Execute the template, refer JDBCToBigQuery documentation for more details. Replace environment values to match your case (gcp project, region, jdbc url, path of jars, etc.)

export GCP_PROJECT=my-gcp-proj \
export REGION=us-central1  \
export SUBNET=projects/my-gcp-proj/regions/us-central1/subnetworks/default   \
export GCS_STAGING_LOCATION=gs://my-gcp-proj/mysql-export/staging \
export JARS="gs://my-gcp-proj/mysql-export/mysql-connector-java-8.0.17.jar,gs://my-gcp-proj/bigquery-jar/spark-bigquery-with-dependencies_2.12-0.23.2.jar"./bin/start.sh \
-- --template=JDBCTOBIGQUERY \
--jdbc.bigquery.input.url="jdbc:mysql://1.1.1.1:3306/mydb?user=root&password=password123" \
--jdbc.bigquery.input.driver="com.mysql.cj.jdbc.Driver" \
--jdbc.bigquery.input.table="(select * from employees where id < 10) as employees" \
--jdbc.bigquery.output.mode="overwrite" \
--jdbc.bigquery.output.dataset="bq-dataset" \
--jdbc.bigquery.output.table="bq-table" \
--jdbc.bigquery.temp.bucket.name="temp-bq-bucket-name"

NOTE: It will ask you to enable Dataproc API, if not enabled already.

Advance Usage (multi threaded export/import)

Assuming you have a table schema of Employee in mysql database as below:

CREATE TABLE `employee` (
  `id` bigint(20) unsigned NOT NULL,
  `name` varchar(100) NOT NULL,
  `email` varchar(100) NOT NULL,
  `current_salary` int unsigned DEFAULT NULL,
  `account_id` bigint(20) unsigned NOT NULL,
  `department` varchar(100) DEFAULT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Assuming the max employee id is 100 million (used for upperBound parameter).

Perform steps 1–4 as described in previous section.
Change step 6 by specifying the partition properties.

Execute spark job along with partition parameters, example below:

export GCP_PROJECT=my-gcp-proj \
export REGION=us-central1  \
export SUBNET=projects/my-gcp-proj/regions/us-central1/subnetworks/default   \
export GCS_STAGING_LOCATION=gs://my-gcp-proj/mysql-export/staging \
export JARS="gs://my-gcp-proj/mysql-export/mysql-connector-java-8.0.17.jar,gs://my-gcp-proj/bigquery-jar/spark-bigquery-with-dependencies_2.12-0.23.2.jar"./bin/start.sh \
-- --template=JDBCTOBIGQUERY \
--jdbc.bigquery.input.url="jdbc:mysql://1.1.1.1:3306/mydb?user=root&password=password123" \
--jdbc.bigquery.input.driver="com.mysql.cj.jdbc.Driver" \
--jdbc.bigquery.input.table="(select * from employees where id < 10) as employees" \
--jdbc.bigquery.output.mode="overwrite" \
--jdbc.bigquery.output.dataset="bq-dataset" \
--jdbc.bigquery.output.table="bq-table" \
--jdbc.bigquery.temp.bucket.name="temp-bq-bucket-name" \
--jdbc.bigquery.input.partitioncolumn=id \
--jdbc.bigquery.input.lowerbound=0 \
--jdbc.bigquery.input.upperbound=100000000 \
--jdbc.bigquery.input.numpartitions=400 \

Another Targets

Another database
Spark JDBC natively supports following databases MySQL / MariaDB, Postgresql, DB2 and Oracle. Using GCSToJDBC template (blogpost)you can ingest data into any of them.
Running JDBCToBigQuery from a Java Environment.