How to serve BigQuery results from MySQL with Cloud SQL

--

See how was possible to use Cloud SQL as intermediary to serve results on Data Studio.

Given a dataset that you have on BigQuery, the script above can work to load any data to any schema and any table:

#!/bin/bash
BUCKET_FOLDER_URI="gs://general-bucket/table"
INSTANCE="instance_name"
DATABASE="schema_name"
TABLE="destination_table"
CSV_FILES=($(gsutil ls $BUCKET_FOLDER_URI))for csv in "${CSV_FILES[@]}"
do
echo "Importing ${csv}"

gcloud sql import csv $INSTANCE $csv --database=$DATABASE --table=$TABLE --quiet
done

Leave the indexes disabled or to be added after the import and as you need them, each case is unique, and all columns index is not the general case.

StackOverflow use case

My colleague Felipe Hoffa wrote a blogpost about how to predict the chances of your question on StackOverflow to be answered using BQML. He trained a model with approximately 40 million possibilities and stored on a table with the following fields:

  • minutes
  • predicted_label
  • probability_of_downvote
  • probability_of_answer
  • tag
  • first_word
  • ends_question
  • weekday_utc
  • account_creation_year
  • question_length
  • hour_utc

Then, the next step is to create such table also on your MySQL Cloud SQL instance:

CREATE TABLE predictions (
minutes NUMERIC(18, 15) NOT NULL,
predicted_label NUMERIC(18, 16) NOT NULL,
probability_of_downvote NUMERIC(19, 17) NOT NULL,
probability_of_answer NUMERIC(19, 17) NOT NULL,
tag VARCHAR(50) NOT NULL,
first_word VARCHAR(30) NOT NULL,
ends_question VARCHAR(10) NOT NULL,
weekday_utc VARCHAR(10) NOT NULL,
account_creation_year INTEGER NOT NULL,
question_length VARCHAR(10) NOT NULL,
hour_utc INTEGER NOT NULL
);

Then export the data from BigQuery:

bq extract --destination_format CSV --field_delimiter "," --print_header=false my-project:stackoverflow.predictions gs://bqml-predictions/data/predictions-*.csv

This will generate several CSV files inside your bucket which you can see through the GUI or the gsutil ls my-bucket command.

Cloud SQL has a GUI to import files to a specific table, however you don't want to do that for each single file. To make things easier, I am using the script above to import those files into it:

#!/bin/bash
BUCKET_FOLDER_URI="gs://bqml-predictions/data"
INSTANCE="my_cloud_sql_instance"
DATABASE="my_schema_name"
TABLE="predictions"

CSV_FILES=($(gsutil ls $BUCKET_FOLDER_URI))

for csv in "${CSV_FILES[@]}"
do
echo "Importing ${csv}"

# --async can't be used on the gcloud sql command
# because it only accepts one import operation at a time

gcloud sql import csv $INSTANCE $csv --database=$DATABASE --table=$TABLE --quiet
done

Side note: you still can use the LOAD DATA LOCAL INFILE for all the CSV's generated, or even generate a single file and import, but better to import small and often than large and once and have things broken.

Consuming

The Data Studio dashboard fires an impressive number of queries to display the data. Specially with all of those filters. In order to make it faster, we discovered that we needed two indexes:

CREATE INDEX all_columns ON predictions(
tag, first_word, ends_question, weekday_utc, account_creation_year, question_length, hour_utc
);
CREATE INDEX all_columns_except_tag ON predictions(
first_word, ends_question, weekday_utc, account_creation_year, question_length, hour_utc
);

If you want to see which queries are fired, I recommend enabling temporarily the performance_schema flag as on on your Cloud SQL instance. Then you will be able to see where the indexes are necessary for your use case. Do not leave the flag on permanently, it will impact on performance.

Monitoring

I chose to have an instance where the whole dataset would fit in memory on my instance. I also enabled High Availability and I am using Stackdriver to monitor CPU and memory as well with some custom alerts to notify me in case certain constraints are met.

Another option is to use MySQL Workbench itself to look on how things are going, here is an example:

MySQL Workbench

At this point traffic was pretty low, I expect those connection numbers and also the SELECT statements to have an increase over time.

Take aways

You can serve your data from BigQuery directly into Data Studio. but depending on the complexity of the queries, using it may be a bit frustrating because of the amount of records you are getting information from.

On this case performance was enhanced by using Cloud SQL as intermediary with the generated probabilities made on BigQuery with BQML and with the appropriate indexes.

See an error? Please comment below!

--

--

Gabriela D'Ávila Ferrara (gabi.dev)

Programess, Miss Diagnosed, Developer Advocate for Google Cloud. I advise on DBs and I code stuff. Owner of over 50,000 LEGO bricks. she/her. Views are my own.