How to load Google Shared Drive Files to Google BigQuery?
Hello,
In this article, I will talk about the Python code we wrote on a request as Trendyol DWH on the Google BigQuery journey.
First, let’s start with the content of the demand; A customer would put their CSV-format files on Google Shared Drive without a specific day and time, and then they wanted to be automatically uploaded to a table for reporting. The uploaded files would have a standard name and its own table.
For Example;
onur_20200912.csv file will be loaded into dwh.temp.onur table, and every subsequent onur_YYYYMMDD.csv file will be loaded into dwh.temp.onur table.
Previously, we could keep Google Drive files in BigQuery as an external table from Google Drive with the help of Google BigQuery native feature. However, this demand was different from the existing system. We wouldn’t know exactly when the incoming files arrived and we didn’t want to go into the BQ UI every time and do manual action. Let’s write a code so that it can check Google Drive whenever we want and load it into the table if there is a file.
After researching how we can do it, we created a model that will consist of the following steps.
- Have a YAML Properties file and let Python code get the parameters from this file.
- Let’s go to the table in BigQuery and check if there is data for that day. If there is data in the table, let’s terminate the code successfully. If there is no data, let’s check if the file of that day has arrived in Google Drive, if the file has arrived in Google Drive, let’s load it into the table.
- Download the file to the machine where the code is running with the Google Drive API. (Because the BQ Python library cannot load data into the table from the Drive URL, I wish it could.: /)
- Upload the downloaded file to BQ.
- Delete the downloaded file from the machine.
After creating our steps, the code we write was as belows. :)
Let’s explain the code step by step;
- In the control_table_for_new_file function, we use the variables load_table, file_name, file_type we got from yaml. We connect to BigQuery and check if there is data in the table for that day. If there is, the code ends successfully. If not, the flow continues.
- In the get_file_from_drive function, we access Google Drive using the Google Drive API with the variables we get from yaml, and find the file and perform the download process.
- In the load_file_to_bigquery function, we load the downloaded file into BigQuery with the variables we get from yaml. After the load process is successful, we delete the file from our local.
In the main part of the code;
- We give the path of the service account json file we created for Google Drive as a variable.
- We provide the path of the service account json file we created for Google BigQuery as an environment variable.
- We parse the yaml file we give as the first input parameter in Python to the variables we need.
- Optionally, we set the date parameter (in YYYYMMDD format) as the second input parameter in Python, or we set the day it is in by default.
- Finally, we call functions with related variables.
I also share below the sample YAML format required for the code to run.
I am also sharing my Conda Environment YAML below.
Happy coding!