Loading MySQL backup files into BigQuery — straight out of Cloud SQL

Reading MySQL backup files into BigQuery is not always easy. Now you can load these backups into a Cloud SQL MySQL instance — and then have BigQuery read straight out of MySQL.

Felipe Hoffa
Sep 9 · 6 min read

The problem

Let’s say you need to load some existing MySQL backups into BigQuery — like the MySQL dumps that GHTorrent publishes. But when you try to load them into BigQuery, you find that it’s not straightforward:

bq load --autodetect temp.users1 gs://fh/users.csv BigQuery error in load operation: Error processing job:Error while reading data, error message: CSV table encountered too many errors, giving up.Failure details:
- gs://fh/users.csv: Error while reading data, error message: Error detected while parsing row starting at position: 1558192614. Error: Data between close double quote (") and field separator.

The solution: Cloud SQL as a BigQuery federated source

Now that BigQuery can read tables straight out from Cloud SQL instances (MySQL and PostgreSQL) we can just load our MySQL backups into Cloud SQL, and then:

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')
# This statement created a new table named fh-bigquery:temp.users.

Step by step

Setup a MySQL instance in Cloud SQL

It’s pretty straightforward to create a MySQL instance:

Then create a user for BigQuery to connect:

Create a database too:

You can connect to it:

gcloud sql connect mycloudsql --user=root --quiet

And setup your tables — in this case, taking the script out of GHTorrent’s schema.sql:

SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='TRADITIONAL,ALLOW_INVALID_DATES';
SET @OLD_TIME_ZONE=@@session.time_zone;
DROP SCHEMA IF EXISTS `ghtorrent` ;
CREATE SCHEMA IF NOT EXISTS `ghtorrent` DEFAULT CHARACTER SET utf8 ;
USE `ghtorrent` ;
-- -----------------------------------------------------
-- Table `ghtorrent`.`users`
-- -----------------------------------------------------
DROP TABLE IF EXISTS `ghtorrent`.`users` ;
CREATE TABLE IF NOT EXISTS `ghtorrent`.`users` (
`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '',
`login` VARCHAR(255) NOT NULL COMMENT '',
`company` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',
`type` VARCHAR(255) NOT NULL DEFAULT 'USR' COMMENT '',
`fake` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`deleted` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',
`long` DECIMAL(11,8) COMMENT '',
`lat` DECIMAL(10,8) COMMENT '',
`country_code` CHAR(3) COMMENT '',
`state` VARCHAR(255) COMMENT '',
`city` VARCHAR(255) COMMENT '',
`location` VARCHAR(255) NULL DEFAULT NULL COMMENT '',
PRIMARY KEY (`id`) COMMENT '')
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;

Note that particular MySQL config line that says ALLOW_INVALID_DATES, which will create some trouble for us later.

Load your backups into CloudSQL

Once you’ve created a database and the tables are set up, reading an existing file from GCS into CloudSQL is a one step operation:

gcloud sql import csv \
mycloudsql \
gs://fh-bigquery/ghtorrent/mysql-2019-02-01/users.csv \
--database=ghtorrent --table=users;

Connect BigQuery to CloudSQL

In the new BigQuery web UI you can create a connection to the CloudSQL instance we just created:

Make sure to create this connection in a matching region.

Now just read into BigQuery

Everything is in it’s place — let’s bring the data in:

CREATE TABLE temp.users ASSELECT * FROM EXTERNAL_QUERY(
'fh-bigquery.US.mycloudsql'
, 'SELECT * FROM users;')

But then:

Ok, now we need to deal with this error message:

Error while reading data, error message: Failed to convert column created_at of 867-th row from MySQL to BigQuery. Error: Returns non-TIMESTAMP value: 0000–00–00 00:00:00, TIMESTAMP must be in format: ‘YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone]’, and range between 0001–01–01 00:00:00 to 9999–12–31 23:59:59.999999. You can use MySQL formatting functions to convert TIMESTAMP to BigQuery format.

Turns out the MySQL table has some invalid timestamps — so it won’t be as straightforward. How do we fix this?

Reading into BigQuery with some SQL magic

The query that fixes the import process:

CREATE TABLE temp.users AS
SELECT * EXCEPT(x) REPLACE(x AS created_at)
FROM EXTERNAL_QUERY(
'fh-bigquery.US.ghtorrent'
, 'SELECT *, IF(created_at="0000-00-00 00:00:00",null,created_at) x FROM users')
# Query complete (5 min 58 sec elapsed, 2.6 GB processed)

How does this work?

  • IF(created_at=”0000–00–00 00:00:00",null,created_at) AS x is executed inside MySQL. When it finds an invalid timestamp, it replaces it with null.
  • SELECT * EXCEPT(x) REPLACE(x AS created_at) is executed inside BigQuery. REPLACE() replaces the faulty created_at column with its fix, and EXCEPT() makes sure we don’t import that temporary column.

Implementation notes

  • The final query took 6 minutes and processed 2.6GB. While testing you can add a LIMIT within the MySQL query to reduce these values substantially (3 seconds and 1KB in the picture’s example).
  • Loading this table into BigQuery didn’t treat \N as null. You can fix that in the same way we fixed created_at invalid timestamps.
  • Note that we didn’t need to set up service accounts or firewall rules anywhere. This is all handled transparently between Cloud SQL and BigQuery.
  • gcloud makes it easy to load data into MySQL tables, but you still need to manually create those tables as shown above.
  • You can stop and restart your Cloud SQL instances as needed (to save costs).
  • I love that BigQuery is able to figure out the schema automatically out of a MySQL SELECT *.

On loading MySQL backups straight into BigQuery

Why wasn’t it straightforward to load the .CSV files produced by MySQL into BigQuery? Turns out MySQL default CSV export is not 100% CSV compliant (due to how they escape their quoted quote marks).

Stack Overflow has been asking about this:

There’s an open issue requesting BigQuery’s support for MySQL’s default CSV export choices — vote and follow it for updates. Meanwhile in this post, we’ll find out how BigQuery can read these tables straight out of MySQL.

Alternative: Parse with a UDF

You could also ingest the rows “raw” and parse the CSV inside BigQuery. See See https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83:

SELECT csv.cols[SAFE_OFFSET(0)] a
, csv.cols[SAFE_OFFSET(1)] b
,csv.cols[SAFE_OFFSET(2)] c
,csv.cols[SAFE_OFFSET(3)] d
,csv.cols[SAFE_OFFSET(4)] e
,csv.cols[SAFE_OFFSET(5)] f
FROM (
SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"') csv
)

Want more?

Imagine the possibilities: Now we can have a live Cloud SQL instance replicating a remote MySQL database, for live updates into BigQuery of your transactional database. For more on this, stay tuned.

I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Tweet me @felipehoffa, my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Felipe Hoffa

Written by

Developer Advocate @Google. Originally from Chile, now in San Francisco and around the world. Let’s talk data.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade