Update 1

Oladayo
3 min readOct 29, 2022

--

Earlier this year, I worked on this https://medium.com/codex/building-an-oilfield-production-surveillance-dashboard-1629865e2ec9. While I was super happy that I worked on a project from start to finish, I was always conscious of some limitations of the approach I took.

Architecture diagram of my previous approach
  1. Every time the python script (data wrangling.py file) runs as a result of a new file added to the first bucket (in google cloud storage), it reads the entire files in the bucket all over.

This means that the python script’s run time keeps increasing with every subsequent run; at some point, the script might fail to run.

Note: Google Cloud Function has a maximum timeout duration of 540 seconds for event-driven functions.

2. Storing the transformed data as an excel file in a second bucket (in google cloud storage) means the transformed data can not be queried for analysis.

What did I do differently this time?

Architecture diagram of my new approach

The only logical approach to address the 1st limitation is to make the python script read just the latest (new) file added to the bucket rather than reading all the files in the bucket.

Every file has metadata such as the file’s name, time created, and time modified among others. If a file has not been modified since the creation of the file, the time created and the time modified for the file are the same.

The latest file in a folder is the file with the most recent time modified in a folder or a bucket as the case may be.

To do this, I had to obtain the attributes blob.updated (which is the time modified for each file in a bucket) (read more here: https://cloud.google.com/storage/docs/json_api/v1/objects#:~:text=The%20modification%20time,Object%20Lifecycle%20Configuration.) of each of the files into a list.

The next step was finding the latest time modified in the list by using the maximum (max) function on the list and then finding the file name with the latest time modified.

The file name can then be used to read the file into a DataFrame.

Snapshot of the code to find the latest file in a GCS bucket

There is a column (Cumulative Production) that is calculated in the script, this new approach led to incorrect calculation, so I had to use conditional logic to fix that.

Finally, I defined the schema (name of columns and the data type of each column) and stored the DataFrame in a table in BigQuery instead of as an excel file in a google cloud storage bucket as in my previous approach.

To address the 2nd limitation, I created a table in BigQuery (BigQuery is a data warehouse for storing structured data and also has an inbuilt query engine) and defined the schema which matches the schema defined in the python script.

DataFrame generated from the python script is then stored in the table.

snapshot of the data in the BigQuery table

Limitations

Well, this approach has its limitations also;

  1. If two files are uploaded into the bucket at once, only one of the files will be read and the other won’t.
  2. If changes are made to a file that has previously been read, uploading the changed file (which is now the latest file) in the bucket will lead to incorrect data for the calculated column (Cumulative Production).

I have an approach to fix limitation 2, I can use conditional logic to ensure the data from the changed file is not stored in the table in BigQuery(pass, do nothing with the data) while updating the BigQuery table with the UPDATE command in SQL with the new data in the changed file.

So what’s next?

I intend to split the python script into tasks, define the dependencies between tasks and use Airflow (a data workflow orchestration tool) to execute, schedule and monitor the workflow.

Thank you for reading.

--

--

Oladayo

data 📈, space 🚀🛰, augmented reality 👓 and photography 📸 enthusiast.