Visualization on Steroid — using headless browser to auto-refresh Google Data Studio dashboards

By Isaac Yiu, Data Engineer

HK01 Product & Technology team
HK01 Tech Blog
6 min readJan 29, 2019

--

Makes fancy and informative dashboards with Google Data Studio!

Google Data Studio is a great data visualization tool for making interactive and beautiful dashboards. We, the data team at HK01, had been using Data Studio to visualize data-sets, user engagement statistics and insight discovery ever since the platform is in beta. While using open-sourced tools like Metabase or pricey Tableau might offer you greater flexibility for data discovery, nevertheless Data Studio remains our top pick for data visualization for the following reasons:

  • Easy integration with various data sources — including Google Analytics, Google Ads, Google BigQuery and common SQL databases (PostgresSQL, MySQL, Redshift Spectrum — our primary query engine); If that’s still insufficient, you can also write your own connector to access wherever your data resides;
  • No-frills, easy to use visualization — it used to be that Google Data Studio hasn’t got a diverse set of chart types compared to other proprietary platforms. Yet, with the recent update, you can even now write your own chart visualizations with your choice of JavaScript chart libraries! This opens up all sorts of new possibilities for data visualization (once this features is out of closed beta);
Community Visualization (currently in closed beta) — now you can write your own visualization with D3.js, Chart.js or whatever your favorite data visualization library is.
  • Hosted and Free — Google Data Studio might not be as feature-rich or flexible compared to open-sourced or proprietary alternatives, but it does have a very huge advantage — you don’t have to host it yourself yet it’s entirely free! On top of that, as our company makes use of Google G Suite for internal communications, it is a big plus that sharing and access control works pretty much the same way as any other Google Docs files.

While the tool itself solves a lot of problems for us, it’s not without demerits, one of the more notable limitation is the lack of an auto-refresh mechanism. To avoid straining the databases/query engine every time a user has reached the dashboard (imagine a thousand concurrent user viewing your dashboard), Data Studio employs the following two levels of caches:

  • Query Cache — every time a user makes a query, if the result for this exact same query was cached already (cache hit), the cached result is returned directly — it’s a passive form of cache;
  • Prefetch Cache — Data Studio predicts what queries could be requested by the user and pre-fetches as much query results as possible — it’s a proactive form of cache;

These caches by default, once refreshed, expire in 12 hours. For all the users viewing the dashboard, they will be seeing the cached data unless the editors of the dashboard explicit click “refresh data”. Due to the lack of an auto-refresh mechanism, report viewers (who by design have no permission to trigger a refresh) will be seeing outdated charts and metrics from probably half a day ago.

The “Cache Control” toggle actually only disables the prefetch cache, so no luck either by disabling cache.

There are certainly other folks that are facing this problem, for there are a numbers of Chrome extensions that automatically clicks the “refresh button” periodically. For a while, that is exactly what we were doing — we were keeping our MacBooks awake during midnight so that our local Chrome browser can click the refresh button once in awhile …

There has got to be a better way, right?

Enters our modern solutions.

So we ended up writing an AWS lambda that triggers a headless browser (a browser that works without GUI) to login to a Google account with editor permission, and then click the ‘refresh’ button. In the following section, I will go through the inner workings of the lambda.

The code used are available here: https://github.com/ymcatar/data-studio-refresher-lambda. You can simply follow the README to set it all up.

It’s not easy to setup a server-less environment in which a headless Chrome can be started successfully, due to all the dependencies that Chromium require. Fortunately for us, an open-source called Puppeteer Lambda Starter Kit has already done most of the heavy lifting for us, all we need to do is to write our main script to automate all the UI clicking for us:

The lambda is triggered asynchronously by Cloudwatch Events, which can be scheduled using cron or rate expressions. The lambda when invoked, will programmatically input the username and password, then navigate to the dashboard URL and then run a tiny Angular.js code snippet that invoke the ‘refresh’ method defined in the app header controller.

UI Automation at work.

Due to the serverless nature of AWS Lambda, the IP address of the machine where the script is run might be constantly changing. Therefore, occasionally Google will add an extra step in the login process, requiring you to type in your recovery email. Luckily for us, it is easy to detect its occurrence (by looking for an H1 element with the content of “Verify it’s you”). We can then add some UI automation logic to handle it accordingly.

Due to the nature of UI automation, it is not rare to see all sorts of exceptions taking place, from what we have observed after deploying to production for awhile, we can see errors from Chrome refusing to launch, pages not loading properly … so how can we handle the logic for retries?

Lambda by default will automatically retry, so no worries if it fails occasionally.

It turns out that we don’t even need to configure anything else, since AWS Lambda automatically re-run the lambda function in case of any unhandled exceptions. For asynchronously invoked lambdas (including Cloudwatch events), they will be retried with some slight delays for a maximum of two times.

Headless browser could be tasked to do far more than auto-refreshing a Google Data Studio dashboard. Off the top of my head, here are some potential applications:

  • Exports Google Data Studio reports as a PDF, and deliver as a daily email report to the whole company;
  • Extracts data on a regular basis from a webpage that lack a public API;
  • Takes screenshots of a website everyday for archiving purpose;
  • Scheduled UI test automation;

We opted to use the headless browser serverlessly, for it lifts you from the need to having a long-running execution environment, and doing chores like resource monitoring or rebooting the machine. Due to the nature of UI automation, it is not likely that the headless browser tasks will be running 24–7, so it would likely to be way more cost-effective to do it in a serverless fashion.

So what are the lesson learnt? That is, while a website may not provide you a programmer-friendly way to do a particular task, but with efforts, one can still attempt to automate a certain process without resorting to manual labor.

P.S. Google, if you are reading this, can you please add the auto-refresh feature — Uh?

Originally published at medium.com on January 29, 2019.

--

--