Update BigQuery columns description from Google Spreadsheet
At my company, we heavily rely on Google Cloud. Specifically for datawarehousing we use BigQuery. With thousands of tables in it, we need to do a good job in keeping metadata tidy and clean.
In order to do this, our product data team has different Google spreadsheets so they can manage the column description for every field of every table. Product don’t have direct access or permissions to edit directly on BigQuery, so they do it in a spreadsheet like this one:
With this spreadsheet, we have processes working in the background that update BigQuery columns description with the data we have in the spreadsheets. These processes use many python libraries Google provides in order to communicate with their products and some other open source libraries as well.
First, we need some methods to talk to the Google spreadsheet and retrieve the fields and their descriptions. We use the library gspread.
import gspread
from oauth2client.service_account import ServiceAccountCredentialsclass GoogleSheetHandler():
def __init__(self):
self.gc = gspread.authorize(ServiceAccountCredentials.from_json_keyfile_name("c://PATH_TO_YOUR_SECRET/credentials.json", ['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive'])) def get_spread_sheet(self,spread_sheet_id,worksheet):
self.spread_sheet = self.gc.open_by_key(spread_sheet_id)
self.worksheet = self.spread_sheet.worksheet(worksheet) def get_all_values_by_col(self,col_id):
return self.worksheet.col_values(col_id)
Make sure you give read permissions in the spreadsheet to the user that is on your credentials.json.
Then we have two helper functions and a main one to update the fields. The tricky and fun part here is that this main function is recursive, so it supports any level of nested Record types BigQuery provides.
We have a BigQuery table like this:
and calling this function:
update_description_columns_bqtable_from_gsheet(project, dataset,table_name,"YOUR_SPREADSHEET_ID","YOUR_WORKSHEEET_NAME",1,4)
then we get this:
Please make sure your fields in the spreadsheet are with the same case as of the columns in BigQuery. While writing this article, I had done the mistake of putting different cases, and of course it didn’t work :)
Happy bigquerying :)