Building a data analytics dashboard on GCP — Part II

Andi Partovi
7 min readMar 23, 2022

--

In part I, I talked about the business problem we are trying to solve, and built the first version of the product to solve it. There were a few flaws in our design, some processes that were manual, and some non-functional requirements that were missing. In each subsequent part of this blog, we will build on top of our first solution and make it better. Each version will be a shippable version that satisfies the core requirements.

In this version, we will use Cloud Scheduler, BigQuery Python API, and Google Cloud Functions to automate the manual components of our solution, and make the dashboard dynamic with automatic updates.

MVP Version 2 — Dynamic dashboard

This version will automate the running of the RSS feeder code, and writing the data to BigQuery through APIs. The bolded components are additions to the previous version:

  1. A python code that reads the RSS feed and parses the website and outputs the data in a CSV format
  2. Writing the data back to BigQuery using Python API
  3. Running the RSS feeder and data write back automatically and on a schedule, using Cloud Scheduler and Google Cloud Functions
  4. Run analytics queries in BigQuery
  5. Build a dashboard in Google Data Studio connected to the BigQuery
  6. Publish the dashboard in a static website
Architecture diagram V2

Step 1 —Add BigQuery API to the Python code

Let’s modify our Python code from part I to now write the scraped data to BigQuery automatically through APIs. We will:

  • Add BigQ necessary libraries: pandas.io, pandas_gbq, gcsfs
    gcsfs enables connections to cloud storage. pandas.io and pandas_gbq enable interactions with Google cloud and BigQ. pandas_gbq has two very handy functions:
    - to_gbq: writing data to a BigQ table.
    e.g. pandas_gbq.to_gbq(dataframe, “dataset.table”)
    - read_gbq: running queries in BigQ tables
    e.g. pandas_gbq.read_gbq(“SELECT * FROM dataset.table”)
  • Make sure that we have an entry point function, which gets “requests” as an argument, and has a return. We are not really using this “requests” argument, but GCF requires it.

I am basing this step heavily on this tutorial.

With these changes implemented, we can create a GCF instance and run the code serverless-ly.

Step 2 — Run the Python code through Google Cloud Functions (GCF)

So far, we have been running our Python code snippet manually. We want to automate this part and do it on the cloud as opposed to a local machine. There are a number of ways we can do this on GCP. You might have seen this chart, as the different alternatives for compute in GCP:

Compute options in GCP

On the most low-level side of the spectrum, we have Compute Engine (GCE), GCP’s Infrastructure as a service (IaaS) offering. If we go with this solution, we can spin out a VM, install Python and the required libraries, install git then clone the above code from a repository, and finally schedule a cron job to run that code at specific times. This is a perfectly valid solution and can achieve what we are trying to achieve here. There are a couple of considerations however that might make this a less suitable option, compared to the other side of the spectrum, managed services.

There are steps in the above solution that is handles by the user, but can be managed automatically by GCP, when we move to a more managed solution like GAE (which is a Platform as a Service solution) or GCF (which is a Function as a Service solution):

  • We need to decide the size of the VM in terms of CPU, memory, disk size, etc. If our needs change later, we have to manually change the VM or provision a new one. We have to keep monitoring the VM to manage this
  • We need to install all the required software and libraries and manage their versions ourselves
  • The VM has to be constantly running (constantly being billed for usage), even if the cron job is not active for some time

While IaaS solutions give more flexibility, in an application like ours where that is not needed, we can opt for a managed service (PaaS or FaaS) that is taking care of some or all of the above points. In “serverless” services, the server provisioning and configuration is hidden from the user and is handled by the cloud provider. This helps in saving cost and reducing devops efforts.

Note that GKE and GAE are also viable options, each taking care of some of the devops aspect that is manual in a GCE solution. So ultimately, we can build the same solution on all 4 options, with different pricings, different management responsibilities, and flexibility levels.

With that background, let’s start building a GCF. GCF manages all of the above tasks and only bills the account when it is run (invoked). Running the GCF can be triggered by a HTTP request, a change in the Google Cloud Storage, a pub/sub topic, or an array of Firebase triggers. It is the perfect choice for event or schedule based applications and microservice architectures.

If it is the first time you are using GCF, there are some set up steps that you need to do. Firstly, enable GCF API: make sure that billing is enabled for the project, then go to APIs and enable GCF

Enable GCF API

Then create a new GCF. As mentioned, there are different types of trigger for GCF. Choose HTTP for now and allow unauthenticated invocations.

Create new GCF

Choose Python for the runtime environment and paste your code. Also add library requirements to the requirements tab

Python runtime for source code
Requirements file outlines the libraries our code uses

Create a new permission, by allowing all users to invoke this GCF (if doesn’t exist under permissions). Choose allUsers and give “Cloud Functions Invoker” permission to the user.

Create new role

Try triggering the function by opening the trigger URL in a browser. You can also test it in the GCF menu. There are a couple of troubleshooting tips to keep in mind. The “details” page and “Logs” are your friends in debugging your code. Also try as much of the code offline first, before pushing it to GCF.

Step 3 — Schedule GCF to run twice a day with Cloud Scheduler

Now that our scraping code snippet can be called and run with GCF, let’s add a small component to make it run twice a day automatically with Cloud Scheduler. Scheduler is like an alarm clock that wakes our function up at a regular time and runs it. It is the equivalent of cron jobs in Operating Systems.

Create a new scheduler

This is the last bit of automation needed to run everything automatically and shift the whole product to the cloud. The last step below is adding a visualisation component to improve the report.

Step 4 — Dashboard improvement: add a timeline view

For this version, I want to do one last update to the appearance of our BI dashboard to include a timeline. The line chart we currently have doesn’t quite convey a timeline. In future versions, we will keep improving the visualisation.

Data Studio offers a range of community-built visualisations that complement the out-of-the-box components. Here we will add a Gantt chart to display medication shortage timelines.

Let’s create a second page in our Data Studio dashboard. Go to your saved report on https://datastudio.google.com/ and click on the page number, then add a new page. You can name each report page separately.

Add new page

Then add the Gantt chart from community visualisations

The first time you add a community chart, you have to give permission for these 3rd party charts to use the data. If you are facing issues here, check the data source and make sure community access is On and you are the owner of data credentials.

Checking community access

Configure the parameters to see the below timeline view.

Build a Gantt chart from our dataset

You will get the following diagram:

Gantt chart as our timeline

Now while this is not very pretty, it is a minor improvement over the line chart we had in the first page. We need to clean up the medication names and categorise them to make this diagram readable. This and architecture improvements will be the focus of part III.

That’s it for V2 of our product. Again, we have ended up with a shippable product that satisfies the core functionalities with improvements in different areas. In part III, we will make some architecture improvements and add more data clean up and processing. Stay tuned!

--

--

Andi Partovi

Andi is an entrepreneur and data scientist. He cofounded KeyLead Health, a cloud-based medical data analytics company in 2019