BigQuery API To Manage Tables + Cloud Functions = ❤️

Rob Salgado
Mar 16 · 4 min read

We have a BI platform that shall remain nameless. The BI platform in question creates temporary tables in BigQuery. Which is fine but the issue is that they are not always deleted after they’re used. As someone who favors order over chaos, I thought it would be a good idea to automatically get rid of them myself.

Using the BigQuery api you can delete these temp tables and you can automate the process with a Cloud Function that runs on a schedule. Pretty neat. Let’s get to it. As always, the full code can be found here. Stick around for the bonus at the end.

The first step is to query the the api to get the table summary info for the dataset:

from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
project_id = <YOUR PROJECT ID>
dataset_id = <YOUR DATASET ID>
sql = """
SELECT * FROM `<YOUR DATASET ID>.__TABLES_SUMMARY__`
"""

Replace anything in <> with your info. Note that since this will live in a Cloud Function on GCP, we don’t need to provide credentials. Throw that query right into a dataframe and voila:

#Running the query and putting the results directly into a df
df = client.query(sql).to_dataframe()

We now have the name of each table in the dataset as well as the creation time and the type.

The temp tables always have the same prefix so it will be easy to id them. We don’t want to delete any temp tables that are less than a day old as they may be being used so we will use the creation time for that. The creation time is a timestamp (in milliseconds) and will reflect the timezone of your project. You can convert it to a date like so:

df['date'] = pd.to_datetime(df['creation_time'],        unit='ms').dt.strftime('%Y-%m-%d')

Next we can just grab today’s date to get yesterday’s date to filter the tables that are less than a day old out.

Since we want to automate this with a Cloud Function we have to first change the timezone from UTC to Eastern. By default Cloud Functions use UTC.

import datetime
from dateutil.relativedelta import relativedelta
from dateutil import tz
#Change the timezone from UTC to Eastern
from_zone = tz.gettz('UTC')
to_zone = tz.gettz('America/New_York')
utc = datetime.datetime.utcnow()
utc = utc.replace(tzinfo=from_zone)
est = utc.astimezone(to_zone)
#Use today's date to get yesterday's
today = est.date()
yest = today - relativedelta(days=1)

OK now that we have yesterday’s date and we are working in the same timezone, we can filter the df to only include the temp tables that are more than 1 day old:

#Getting all the temp tables
df_del = df.loc[df['table_id'].str.contains('^temp_*')]
#Getting all the ones that are more than 1 day old
df_del = df_del.loc[df_del['date'] <= pd.to_datetime(yest).strftime('%Y-%m-%d')]

Now we have the temp tables we want to delete, the next step is to call the api again and delete them. You can only delete one table at a time and since there may be more than one table per day, we will just put them in a list and loop through it like so:

#Into a list they go
temp_to_del = df_del['table_id'].tolist()
#Looping through the list of temp tables and deleting each one
for each in temp_to_del:
table_ref = client.dataset(dataset_id).table(each)
client.delete_table(table_ref)

Ad that’s it. Now just throw that into a Cloud Function and schedule it to run every day.

BONUS TIME

If you’re like me, you are using Cloud Datalab sitting on a huge virtual machine to analyze large amounts of data which mostly lives in BQ. Therefore you need a good way to get it into DataLab.

One of the better ways to get a lot of data out of BQ and into DataLab is to query it into a BQ table and then use the api to dump the data from that table into a csv on Google Cloud Storage. It’s then easy to get that csv into DataLab.

#EXPORT BQ TABLE TO GCS AS CSV
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(<PATH TO CREDENTIALS JSON FILE>')bucket_name = <YOUR BUCKET NAME>
project = <YOUR PROJECT NAME>
dataset_id = <YOUR DATASET ID>
table_id = <YOUR TABLE ID>
client = bigquery.Client(credentials= credentials,project=project)
#Note you have to specify a file prefix since it will be in multiple files
destination_uri = 'gs://{}/{}'.format(bucket_name, 'file_prefix*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(table_ref,
destination_uri,
location='US')
extract_job.result()print('Exported {}:{}.{} to {}'.format(project, dataset_id, table_id, destination_uri))

Then once you’re in DataLab, the following will download all the files with that prefix and put them right into a df.

import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
#Load a series of csv files with the same prefix
myBucket = storage.Bucket(<YOUR BUCKET>)
object_list = myBucket.objects(prefix='prefix')
df_list = []for object in object_list:
%gcs read --object $object.uri --variable data
df_list.append(pd.read_csv(BytesIO(data)))
df = pd.concat(df_list)
Photo by Freddy Castro on Unsplash

Rob Salgado

Written by

Data Science | Business Intelligence | NYC https://www.linkedin.com/in/rob-salgado/