Loading MySQL backup files into BigQuery — straight out of Cloud SQL

Reading MySQL backup files into BigQuery is not always easy. Now you can load these backups into a Cloud SQL MySQL instance — and then have BigQuery read straight out of MySQL.

Felipe Hoffa
Sep 9 · 6 min read

The problem

Let’s say you need to load some existing MySQL backups into BigQuery — like the MySQL dumps that GHTorrent publishes. But when you try to load them into BigQuery, you find that it’s not straightforward:

bq load --autodetect temp.users1 gs://fh/users.csv BigQuery error in load operation: Error processing job:Error while reading data, error message: CSV table encountered too many errors, giving up.Failure details:
- gs://fh/users.csv: Error while reading data, error message: Error detected while parsing row starting at position: 1558192614. Error: Data between close double quote (") and field separator.

The solution: Cloud SQL as a BigQuery federated source

Now that BigQuery can read tables straight out from Cloud SQL instances (MySQL and PostgreSQL) we can just load our MySQL backups into Cloud SQL, and then:

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')
# This statement created a new table named fh-bigquery:temp.users.

Step by step

Setup a MySQL instance in Cloud SQL

It’s pretty straightforward to create a MySQL instance:

https://console.cloud.google.com/sql/create-instance-mysql
Creating a user for BigQuery
Creating the ghtorrent database
gcloud sql connect mycloudsql --user=root --quiet
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL,ALLOW_INVALID_DATES';
SET @OLD_TIME_ZONE=@@session.time_zone;
DROP SCHEMA IF EXISTS `ghtorrent` ;
CREATE SCHEMA IF NOT EXISTS `ghtorrent` DEFAULT CHARACTER SET utf8 ;
USE `ghtorrent` ;
-- -----------------------------------------------------
-- Table `ghtorrent`.`users`
-- -----------------------------------------------------
DROP TABLE IF EXISTS `ghtorrent`.`users` ;
CREATE TABLE IF NOT EXISTS `ghtorrent`.`users` (
`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '',
`login` VARCHAR(255) NOT NULL COMMENT '',
`company` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',
`type` VARCHAR(255) NOT NULL DEFAULT 'USR' COMMENT '',
`fake` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`deleted` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`long` DECIMAL(11,8) COMMENT '',
`lat` DECIMAL(10,8) COMMENT '',
`country_code` CHAR(3) COMMENT '',
`state` VARCHAR(255) COMMENT '',
`city` VARCHAR(255) COMMENT '',
`location` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
PRIMARY KEY (`id`) COMMENT '')
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;

Load your backups into CloudSQL

Once you’ve created a database and the tables are set up, reading an existing file from GCS into CloudSQL is a one step operation:

gcloud sql import csv \
mycloudsql \
gs://fh-bigquery/ghtorrent/mysql-2019-02-01/users.csv \
--database=ghtorrent --table=users;

Connect BigQuery to CloudSQL

In the new BigQuery web UI you can create a connection to the CloudSQL instance we just created:

Connecting BigQuery to our CloudSQL instance

Now just read into BigQuery

Everything is in it’s place — let’s bring the data in:

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')
Errors while reading from MySQL
Error while reading data, error message: Failed to convert column created_at of 867-th row from MySQL to BigQuery. Error: Returns non-TIMESTAMP value: 0000–00–00 00:00:00, TIMESTAMP must be in format: ‘YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone]’, and range between 0001–01–01 00:00:00 to 9999–12–31 23:59:59.999999. You can use MySQL formatting functions to convert TIMESTAMP to BigQuery format.

Reading into BigQuery with some SQL magic

The query that fixes the import process:

CREATE TABLE temp.users AS
SELECT * EXCEPT(x) REPLACE(x AS created_at)
FROM EXTERNAL_QUERY(
'fh-bigquery.US.ghtorrent'
, 'SELECT *, IF(created_at="0000-00-00 00:00:00",null,created_at) x FROM users')
# Query complete (5 min 58 sec elapsed, 2.6 GB processed)
  • SELECT * EXCEPT(x) REPLACE(x AS created_at) is executed inside BigQuery. REPLACE() replaces the faulty created_at column with its fix, and EXCEPT() makes sure we don’t import that temporary column.

Implementation notes

  • The final query took 6 minutes and processed 2.6GB. While testing you can add a LIMIT within the MySQL query to reduce these values substantially (3 seconds and 1KB in the picture’s example).
  • Note that we didn’t need to set up service accounts or firewall rules anywhere. This is all handled transparently between Cloud SQL and BigQuery.
  • gcloud makes it easy to load data into MySQL tables, but you still need to manually create those tables as shown above.
  • You can stop and restart your Cloud SQL instances as needed (to save costs).
  • I love that BigQuery is able to figure out the schema automatically out of a MySQL SELECT *.

On loading MySQL backups straight into BigQuery

Why wasn’t it straightforward to load the .CSV files produced by MySQL into BigQuery? Turns out MySQL default CSV export is not 100% CSV compliant (due to how they escape their quoted quote marks).

Alternative: Parse with a UDF

You could also ingest the rows “raw” and parse the CSV inside BigQuery. See See https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83:

SELECT csv.cols[SAFE_OFFSET(0)] a
, csv.cols[SAFE_OFFSET(1)] b
,csv.cols[SAFE_OFFSET(2)] c
,csv.cols[SAFE_OFFSET(3)] d
,csv.cols[SAFE_OFFSET(4)] e
,csv.cols[SAFE_OFFSET(5)] f
FROM (
SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
)

Want more?

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Felipe Hoffa

Written by

Developer Advocate @Google. Originally from Chile, now in San Francisco and around the world. Let’s talk data.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade