BigQuery for Big Data and AI

Pinku Deb Nath
Jul 6, 2019 · 4 min read

BigQuery allows us to fetch data from distributed storages using SQL queries. Pretty simple. Yet, behind the scenes, a lot of engineering feats are performed to ensure the functionalities and determinism of SQL queries over distributed systems. Using BigQuery, we can initially fetch a small subset of data to locally train our models, and then train our final model in the cloud using the full dataset from BigQuery.

BigQuery is a service provided from within Google Cloud Platform, for which you need a credit card. Assuming that you have a credit card, don’t worry, the charges are minimal.

First, we need to create a Storage Bucket. Choose a Regional bucket. It is safe to pick the same region for all other services that might use BigQuery, such as Datalab which is an IPython notebook hosted in GCP. Go to Navigation menu->Storage->Browser and then click Create bucket. The UI might change a bit, so might need a bit exploring.

Speaking of which, if we want to run IPython within GCP, we can launch Cloud Datalab to query our BigQuery storage. We could do it locally using BigQuery APIs for python, but it is simpler within Datalab because authentication and other tasks are automatically ensured. Besides python codes are the same for any environment.

  • Within our GCP, open the Cloud Shell terminal.
  • List the available zones using gcloud compute zones list.
  • Create a lab using datalab create labnamevm --zone <ZONE> .
  • To reconnect to previously created Datalab instance, datalab connect labnamevm.
  • In cloud shell, follow any prompts. If asked for ssh passphrase, hit return for no passphrase.
  • Once the Datalab instance is created, open the Web Preview from the top-right corner of the Cloud Shell ribbon.
  • Create a new notebook by clicking on the icon at the top left.

Now, we will query BigQuery using python. We will use sample dataset provided within the Google Cloud Platform. For our own data, we can upload it first to the created Storage bucket.

# Create SQL query using natality data after the year 2000
query = """
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
publicdata.samples.natality
WHERE year > 2000
"""
import google.datalab.bigquery as bq
df = bq.Query(query + " LIMIT 100").execute().result().to_dataframe()
df.head()

Here, we are executing the query and limiting response to 100 records, getting the response and converting to a data frame. Pretty convenient!

The following is a sample output of the code.

To count the total number of records that would be fetched by the previous query without the LIMIT:

countQuery = "SELECT COUNT(weight_pounds) FROM (" + query + ")

Suppose, we want to create a train and test set, such that the sets are distinct. However, there can be more than copies of the same record which could end up in distributed in the two sets. To handle the scenario, we could calculate a hash of a combination of fields and execute conditionals on the hash value to perform distinct selections.

query = """
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(is_male as string), CAST(YEAR AS STRING), CAST(month AS STRING)))), 10) AS hashmonth
FROM
publicdata.samples.natality
WHERE MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(is_male as string), CAST(YEAR AS STRING), CAST(month AS STRING)))), 10) >= 8
"""
df = bq.Query(query1).execute().result().to_dataframe()
df.head()

Output:

The above query will return all the data satisfying the conditions. Suppose we want only 10% of the data to test our ideas, in which case we can modify the WHERE clause with rand() function:

WHERE MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(is_male as string), CAST(YEAR AS STRING), CAST(month AS STRING)))), 10) >= 8 and rand() < 0.1

Suppose, we want to remove the records which have NaN in certain columns:

import pandas as pd
df1 = df[pd.notnull(df.plurality)]

Suppose we want to map the values of a column to something else:

pluralityMapper = dict(zip([1,2,3,4,5],
['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
df1['plurality'].replace(pluralityMapper, inplace=True)

Let’s see the modified data frame:

df.head()

We can export the data frame as CSV files:

df.to_csv('test.csv', index=False, header=False)

I will keep on adding stuff as I learn. If I missed something or made an error, do knock me. Cheers!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade