[Python] Fast export large database tables — using GCP Serverless Dataproc

Import large tables from any JDBC (MySQL, PostgreSQL, MSSQL) to BigQuery

Rafael Silva
Google Cloud - Community
3 min readSep 15, 2022

--

Dataproc Serverless is a great addition to the Google Cloud Dataproc platform. It allows users to run Spark workloads without the provisioning or management of clusters. Dataproc Serverless simply manages all the infrastructure required behind the scenes.

Dataproc Templates provide us common use cases of those kind of workloads, without the need to develop them ourselves. These templates also let us customize and run them quickly.

Introduction

If you need to import/export large tables with 100s of GBs-TBs fast, approach to import/export data out in multiple threads parallelly, and using a robust, proven, open source and hardened mechanism. This post may help you.

Let’s use the JDBCToBigQuery template to export tables in a fast, efficient and multi threaded fashion.

Requirements

  • Use any machine with Python 3.7+, Git and gcloud CLI pre-installed.
    Alternatively use cloud shell, which has those tools pre-installed.
  • A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled. You can review all the Dataproc Serverless networking requirements here.

Simple Usage

This approach is not multi-threaded, so it works fine with tables smaller than 1Gb in size.

  1. [Recommended] Clone your active instance, or create a read replica. Pausing writes to the source database is recommended for consistency purposes.
  2. Make sure your database is reachable from VPC network. If using a public database, make sure to enable cloud NAT. Please, refer to this for further information.
  3. Create a GCS bucket and staging location for your jar files. Download the JDBC Driver jar for the respective source database, and the BigQuery connector with Spark. Upload those jars files into the GCS Bucket.
  4. Clone Dataproc Templates git repo:

5. Obtain authentication credentials:

6. Execute the template, refer JDBCToBigQuery documentation for more details. Replace environment values to match your case (gcp project, region, jdbc url, path of jars, etc.)

NOTE: It will ask you to enable Dataproc API, if not enabled already.

Advance Usage (multi threaded export/import)

Assuming you have a table schema of Employee in mysql database as below:

Assuming the max employee id is 100 million (used for upperBound parameter).

Perform steps 1–4 as described in previous section.
Change step 6 by specifying the partition properties.

Execute spark job along with partition parameters, example below:

Another Targets

  1. Another database
    Spark JDBC natively supports following databases MySQL / MariaDB, Postgresql, DB2 and Oracle. Using GCSToJDBC template (blogpost)you can ingest data into any of them.
  2. Running JDBCToBigQuery from a Java Environment.

References

--

--

Rafael Silva
Google Cloud - Community

GCP Big Data Engineer/Cloud Architect. Passionate about distributed systems, parallel computing, CLI tools and automating stuff.