Building a data analytics dashboard on GCP — Part I

Andi Partovi
10 min readMar 17, 2022

--

Recently I came across a cool business problem to visualise medication supply chain disruptions, something that can be viewed by medical doctors and pharma companies to make decisions and find substitutes based on. This was around the time that I was refreshing my Google Cloud Platform (GCP) knowledge and seemed like the universe wants me to build this tool using GCP services!

This dashboard is live and you can view it at: www.medalarm.info

High level business objective

Like other projects in my startup, I approached this from a product management iterative approach and thought about the different versions that can achieve the bare minimum requirements of the projects and gradually built on that. I documented this whole process from both the product management and technical build perspective, which became this series of hands-on tutorials on GCP. This series does not assume any prior knowledge of cloud or GCP, and is objective-focused, meaning if something is not relevant to the project objective, it is generally only brushed on or not mentioned at all.

In Part I, I will talk about the business problem we are solving and what the first version looks like. From the build perspective, I present a tutorial on: Google Cloud Storage, Big Query, and Data Studio as I build the static version of the solution.

In part II, I will automate all the steps in the Version 1 solution and make the dashboard update automatically. For that, I will go into more details about Google Cloud Functions, Big Query Python APIs, and Cloud Scheduler.

In part III, I will improve the architecture by adding DataFlow and Cloud Composer, and redo the visualisation dashboard in Python.

Business problem

The COVID interruptions to the global supply chain has dire implications on medication shortage. Medical doctors and pharmaceutical companies need to know what medications are in shortage and for how long to be able to make decisions like prescribing alternatives or changing clinical trial parameters.

Australia’s TGA (Therapeutic Goods Administration) publishes the list of medication shortages in Australia on their website, but the information is presented in a long table format with no visualisation. We want to build an interactive BI dashboard that is fed TGA’s data but allows users to see the supply chain issues in nicer graphics.

Snapshot of TGA’s website — https://apps.tga.gov.au/prod/MSI/search

Solutions

The core business requirements are visualising the data in an online dashboard, so every version of our product will fulfill this business objective. Each version is also fully functional and shippable.

MVP Version 1 — Offline dashboard

V1.0: A one-off static version of the dashboard that gets the data from TGA once, and present in a dashboard.

Components:

  1. A python code that reads the RSS feed and parses the website and outputs the data in a CSV format
  2. Manually uploading the CSV in Google Big Query
  3. Run analytics queries in Big Query
  4. Build a dashboard in Google Data Studio connected to the Big Query
  5. Publish the dashboard in a static website

This architecture is not following best practice…yet. This will be fixed as we add new components in the next parts.

Step 1 — Build the RSS Feeder with Python

The Python code below retrieves the information we need from the TGA website. We will first access the RSS feed, which returns an XML file. The XML includes the title of the medication, the cause of supply chain disruption, and a link to where we can find the disruption dates. We need to look for the dates in the underlying code of the medication link. The Python code uses these libraries:

  • Requests handles the http request reading
  • BeautifulSoup handles the extraction and parsing of the RSS feed (XML file)
  • CSV to save the list in a csv
  • Re (regex) converts the date format
  • Datetime transform a date string to datetime object
  • Pandas creates a dataframe from dict and saves to csv file

Step 2 — Enter the Big Query!

Big Query is the crown jewel of GCP. It is the enterprise data warehouse solution to handle analytics of Petabytes of data. We are however using it for the very small dataset we made in the previous step.

BigQ is a typical OLAP data warehouse, designed for fast analytics over massive amounts of data. It can ingest data from multiple sources, store them in a columnar and distributed storage system, and enable running SQL queries over them. All of these are managed by GCP, meaning that there is no need to provision storage or compute.

For this version of the product, we will manually upload the CSV file. In the next version, we will automate this step using Big Query Python APIs. Also for this version, we are not following the best practice around keeping a copy of our raw data in a storage bucket and ingest from there. We are going straight to BigQ. We will improve on this architecture as we build onto the solution in the next part.

To upload the file, first make a new dataset, then add a table. This can be done on the GUI or using Cloud Shell. Cloud shell is an amazing tool that gives you a very quick access to a VM to run CommandLine Interface super quickly. This is equivalent of spinning up a VM (Compute Engine) and then decommission it once done.

How to access and activate cloud shell
bq mk TGA_medalert_dataset

You can also do it on the GUI:

Creating a dataset using GUI

Then create a table in the dataset

Using the GUI: click on the dataset option menu and choose create table

Choose the upload option and set the table name
Setting the table schema

Schema tells our relational database (BigQuery) on what are each columns of the data. Setting the Schema manually often works better than letting BigQuery automatically figure it out, but that is also an option. Also the date format has to be YYYY-MM-DD for BigQuery to parse the input as DATE.

Because our CSV has a header row, in the Advanced option, we will set the header rows to skip to 1. Generally speaking, partitioning and clustering the database table is critical to get good performance in a big data setting. You can learn all about it here. For our small dataset however, it is not critical.

Creating the table using GUI

You can preview the table and run queries once it is created

Table preview

Let’s do a very simple query, to see the different descriptions, and how many of each we have. Note that BigQuery will charge based on the amount of data being processed, which you will get a preview of in the top right corner, before running the query.

Query preview

Now let’s select all of the data and go to Data Studio for visualisation

Run the query, then go to Data Studio

Step 3 — Dashboarding with Data Studio

So far, we have scraped the data from TGA’s website, put it in a structured CSV format and uploaded that into a Big Query SQL-based table. Now we want to visualise the data. For this version of the product, I will do a very simple visualisation, instead of the final polished product. But remember that each version has to be shippable, so we need to build a timeline, even if not pretty.

There a handful of good Business Intelligence (BI) dashboards in the market. You might have worked with Tableau, which is the most mature one in functionality (also the most expensive one). GCP’s Data Studio is free, and is very easy to pick up. Options are easy to figure out and there is a growing catalogue of community-built visualisations to choose from as well. It does not however offer the same level of complexity as Tableau. So my rule of thumb is start with the free Data Studio tool, and if you hit barriers with your visualisation, then seek alternatives. We are going to look at visualisation with Dash in later parts of this tutorial for a very customised look.

Building a crude timeline with line chart

Although this is a very basic visualisation, it will do for the first version. Hit “save” and then “share”

Share the explorer

That takes us from the “explorer view” to the “report view”. Here I can add more elements to make this report look better. Here I added a few text boxes, will links and also another diagram

Reprot view

Click on “share” drop-down menu and choose “Embed report” since we want to embed this into a HTML webpage. Alternatively, we can just share the report link, but that won’t look as nice (especially in future versions)

Getting an embedding link

You will get an iframe to embed in your website

Getting an embedding link

Also make sure that the link is publicly available. Click on “share” again and go to “Manage access” tab, then choose “Anyone on the Internet can find and view” option.

Making the report public

Step 4— Static webpage with Google Cloud Storage

Let’s finish up by building a HTML webpage and hosting it on GCP. Since this version of the product is only a static website, we can use Google Cloud Storage buckets for a quick deploy. For this step, we are largely following the tutorial here.

Our “index.html file” will only embed the report link (I have changed the sizes here:

<!DOCTYPE html>
<html>
<body>
<iframe width="1200" height="900" src="https://datastudio.google.com/embed/reporting/cbd5b1ef-c0e5-4786-aa6b-2c26dd6b4853/page/XRHoC" frameborder="0" style="border:0" allowfullscreen></iframe></body>
</html>

Not let’s host this on GCS. First choose the domain you are hosting this on. For us is www.medalarm.info. I used Google Domains to buy this, but any domain manager will work the same.

Create a GCS bucket with the name of your top-level domain, so for us it has to be “medalarm.info” (not “www.medalarm.info”). Go to Cloud Storage and create a bucket. You can also use the CLI. In terms of options:

  • Multi-region: the best performance, but also the most expensive option. Hosts the bucket on multiple region. Here I am using single region in Melbourne, which is closer to me.
  • Choose “standard” storage as we need to access this bucket often. The other options are discounted if the bucket is not accessed often.
Creating a bucket

Upload the index.html file into the bucket

Uploading the index.html file

Since we need the Internet to access this bucket, we need to add a new permission for it. Add a new permission for “allUsers” to be “Storage Object Viewer”. This will make the bucket public to the entire Internet

Making the storage bucket public

Now let’s direct the traffic for our domain to here. For this, we need to add a CNAME DNS record in our domain manger. So I added this on Google Domains. Use the below parameters for adding the record. Like any DNS record change, allow an hour or so for this to be propagated over the internet. You can always check your routings here

One last step to do is to add a main page and 404 page pointers to the bucket so when user accesses those addresses, GCS knows where to direct the traffic. We did not design a 404 for our website, but we can point main to index.html

Editting the website config
Editting the website config

Troubleshooting this whole step might be tricky, you can refer to this list for troubleshooting.

First version is ready!

Our hosted report — V1

We are done with the first version of our product! It is satisfying all the core customer needs, but there are lots of improvements we can apply to it. The process is still very manual, the dashboard does not update automatically, and our timeline…well, is not really a timeline. I will fix all these problems in the next versions.

You can continue to part II, where we will add dynamic updates and make the whole process automatic.

--

--

Andi Partovi

Andi is an entrepreneur and data scientist. He cofounded KeyLead Health, a cloud-based medical data analytics company in 2019