Visualizing your custom data in Google Data Studio

Thoughts on building a custom connector for Google Data Studio

Here’s what it can do and what it can’t do (easily)!

21 min readJun 2, 2020

I had a problem! I wanted to understand and visualize data and share insights with others. And I was looking for a live data connection. Meaning I didn’t want to download data from the source systems and import it into another system but rather just open the report or click the reload button and see the most current data. Nowadays, most data sources offer a REST API to access data at wish. Why not using that?

While looking for a solution to my problem, I landed with Google Data Studio (GDS). Have you ever heard of GDS? Lucky you! Many people don’t even know that such a powerful tool exists and that it’s for free. Some years ago I was playing around with GDS for analyzing our Google Analytics data that we had gathered from the usage of our Android App. That’s exactly where GDS originates. It was designed as a simple but handy tool to dig through your Google Analytics data.

Meanwhile, it has evolved. It can connect to a variety of traditional data sources and it seamlessly integrates with all data storages of the Google Cloud Platform (e.g. Data Store, Big Query, and so on). Besides that impressive connectivity, its biggest advantage is its simplicity. There are reports and there are data sources used in reports. Both, intuitively to use. Both, powerful.

But nothing is as easy as it seems. This article will explain some common use cases of data inside Google Data Studio and elaborate on how they impact your code and how data is being retrieved from your custom API. It may help you to decide whether to stick with your current solution or start implementing a custom connector for Google Data Studio.

Every issue that is argued in this article is based on a problem that I was running into. This article is no coding guideline but a list of thoughts. If you are looking for a full introduction on how to build a custom connector from scratch, please refer to the Google documentation or this interesting post.

Why a custom connector for Google Data Studio?

The beauty of splitting report design from data retrieval.

But let’s step back. What was my problem? Right, I was looking for a solution to visualize my data. So I started talking to my peers. Asking them “How do you work with that data?” or “What tool are you using?”. I learned about many different solutions. Since I’m quite familiar with working with data in code, my immediate choice was Python with Numpy and Pandas. It’s the most powerful setup for any data wrangling. But you need to code yourself and visualizing data with Matplotlib or other common libraries is not as handy if you are not using them daily. Others were okay with using Microsoft Excel with a snapshot of data (data export).

Finally, I came across another interesting idea. Someone built a Google Apps Script with just a few code snippets and used that script inside a Google Spreadsheet. It’s somewhat similar to macros inside Microsoft Excel but more powerful. There were two things that I immediately liked. First, the code was quite short and not that complex. Second, since the data was imported into the spreadsheet, the analyst had the freedom to explore and test any kind of visualization without ever touching the code.

But there were three things I did not like. First, Google App Script is based on JavaScript and I don’t know many people who are the biggest fans of JavaScript. I was used to Microsoft .NET, Python and several other high-level programming languages, but JavaScript?! There is no type-safety and many other things are missing, too. But most prominently there is nothing like the Pandas Dataframe. Our raw data is based on timestamps and for analytical reasons, it makes sense to pre-aggregate the data (e.g. grouping by weeks and joining with other configuration data). No problem with Pandas, but a bit more complex with Javascript even though there are functions like .map() and .reduce(). The second thing I disliked was the fact, that the script is bound to the spreadsheet. It consumes some configuration data from one sheet, joins it with the raw data and outputs the final data onto another sheet. Someone could easily destroy that fragile system by just deleting some cells or adding a new column. Third, you are losing control over how often the API is called. If the script includes functions that will fetch data from the API, the analyst may use those functions in lots of cells in a Spreadsheet thus sparking lots of API requests once the sheet is opened.

A dynamic report connecting to live data via a container-bound script

Nevertheless, my interest was sparked. The balance between the freedom of design for the data analyst and the reduced effort of maintaining (or continuously developing) code by a software developer was just too catchy. After some further research, I learned that Google App Script comes in two flavours. Container-bound like the one I’ve seen, being embedded into a spreadsheet or document making it very fragile and hard to share with others. Or as a standalone script that can be referenced by others or being published as a web app without an own user interface.

Google uses those to extend spreadsheets and documents by creating addons. It is an elegant and very logical split between retrieving data (Google Apps Script) and working with that data (Google Spreadsheet) whereas the data retrieval is a lightweight script that is executed in a hidden Google Cloud Project on-demand at no costs or in a separate project with accumulated costs if being published and commercialized publicly.

That’s how I landed with GDS again. Any data source that you want to use in a report must be created (e.g. setup and configured) via a connector template. If you don’t find the perfect connector in the GDS gallery, you can implement a custom connector as a standalone Google App Script. While a container-bound script would get its input (e.g. configuration) data from a spreadsheet, the GDS connector is being configured once during the setup of the data source itself making it more robust since the configuration of the data source is split from the actual usage inside the report.

A dynamic report connecting to live data via a custom data source

Working with data and reports

Data views (e.g. joined or blended data)

Quite often you would integrate more than just one data source into your report. Here comes one of the biggest advantages of GDS into play. You can integrate tons of data sources via their different connector templates. Your custom connector will still read your custom data only while all other data is read without changing a single line on your custom connectors code.

In case you need to join the different data sources into one, you will face one of the downsides of GDS. Whoever is designing the report must manually define blended data sources that are joined on data fields that are existing in all the data sources that are joined into the blended data set. This is very much like writing SQL JOINs to query data from a traditional database. But you can not define the kind of the join operation.

For example, reading all tweets from Twitter between a start/end date but limited to specific users that have been manually listed in a Google Spreadsheet. In such a case, the spreadsheet data will be blended with the tweets resulting in a SQL LEFT JOIN. The following SQL pseudo statement illustrates the idea whereas twitter is the custom connector and the data on the followers comes from a private Google spreadsheet.

SELECT <tweet, user, date> FROM twitter as A INNER JOIN followers as B ON A.user = B.user WHERE A.date BETWEEN <start> AND <end>

Or using a SELECT WHERE clause:

SELECT <tweet, user, date> FROM twitter WHERE date BETWEEN <start> AND <end> AND user IN (SELECT user FROM followers)

No low-level data joins

If you are using one data source A to join with or filter another data source B, then A is an additional filter for B thus limiting the amount of data being transferred from data source B (e.g. date between a custom start/end date and users within a dedicated list). In GDS this is not available on the level of the data connector (low-level or API-level). Instead, both data sources will read data with their dedicated filters only (e.g. date between a custom start/end date). GDS will blend both datasets later on (high-level or report-level).

Creating data views (blended data) in Google Data Studio

In such a scenario there is no way to easily reduce the number of data rows being retrieved from the data source B based on the input of data source A. This is especially a problem if your custom connector is reading data from an API with strict quota limits on data transfers and daily or concurrent connections. In such a case, a Python/Pandas script is handier since the filter data A is loaded into a data frame and then integrated into the query that reads the tweets (data B).

Loading, transforming and merging of data inside a Python script using API-level query filters

Calculated fields

Another advantage of a script is that you can cleanse, transform, manipulate or aggregate all data in code until a format is reached that is optimal for finally visualizing the data. On the other hand, GDS is no tool for data transformation. If your raw data needs lots of transformation, you will run into a lot of pain. But, if you need simple operations based on one or multiple data fields, then you will love calculated fields.

Imagine, the raw data includes a person’s full name but you would like to include just the first name in the report to keep the names short. Simply add a new field based on a regular expression that extracts the surname from the full name. The beauty of this is, that there is no need to change your custom connectors code.

Usually, a calculated field will be added to the chart in which the field is being displayed (e.g. as a column in a plain data table). But GDS even allows you to add fields to the data source itself. This is especially interesting if the field is used a lot throughout one or multiple reports. Just think of the first/surname extraction or date/time conversions.

There is no TODAY nor NOW

When using calculated fields, you will sooner or later need the current time in order to calculate daily or monthly averages. For instance, you would like to get the number of days of the current month that have passed by. GDS will provide you with a date range if your custom data source requires it. But depending on the custom report configuration or user selection of that date range you will just get a start and an end date but no current date. The only workaround I found so far is adding a new fixed field to your custom connectors code containing the current timestamp.

No report bands

a banded report with a custom header and footer

Maybe I’m mistaken, but to my knowledge, there are no banded controls or report bands. For instance, you may want to design a default report template that shall be used for reporting on project budgets. Then you will configure a dedicated list of projects that you are interested in and feed those projects as a separate data source (e.g. Google Spreadsheet) or pre-configured filter to GDS. The desired outcome is one single report including details on each project of interest. But apart from data tables, charts, or pivot tables that may group data by project, there is no way to group several charts into a template that will be repeated and filled with the corresponding project data. The only workaround I can think of is including a report parameter as a customizable query parameter in the URL when sharing the report with others. Then, the report must be generated for every single project.

Filtering data

There is no key-value filter control

GDS was designed to connect to Big Query and other NoSQL data sources that, in contrast to traditional databases, do not depend on IDs. If there is an entity having a name (for instance a follower), then filtering is done using any text value.

SELECT * FROM tweets WHERE follower.name LIKE "John Doe"

In a relational data schema, the entity might be stored with an auto-incremental unique ID. If you want to filter other data related to that entity, you would rather use the user’s ID than its name.

SELECT * FROM tweets WHERE follower.id EQUALS 1

If the report should include a way to enable the user to filter for specific followers, the GDS filter control is the way to go. Usually, it is assigned with the follower names a human-readable display values, but the follower’s ID as keys. If the user selects a value, the underlying key (e.g. the ID) should be given to the data source. Unfortunately, the GDS filter control does not support such key-value relations. This is especially painful if your custom data source supports data filtering on API level (e.g. pre-filtering via an HTTP GET query with query parameters). In such a case, you still need to load all data first and filter for the desired followers later on (post-filtering). The data being transferred might be quite huge and the server load high.

As a workaround for this scenario, I am defining filterable key-value relations in the data schema of my custom connector. All data that is retrieved from the API will be scanned for such related fields and the key-values are cached as value-key pairs. Any new data request invoked by GDS will be scanned for the same relations and if the value is known, it will be mapped to its key thus enabling API level pre-filtering.

[{
  id:’user_id’,
  name:’User ID’,
  filter:true,
  type:types.NUMBER
 },{
  id:’user_name’,
  key:’user_id’,
  name:’User’,
  type:types.TEXT
 }]

The properties id, name and type are used by GDS. All others are used inside my custom connector only. The downside of this is, that you don’t know if your cached data is invalid until you check it against the API. Furthermore, you can not use the keys until they have been retrieved from the API and cached by your connector once. So, any initial data request will query the full data. Any subsequent data request

{
 operator: ‘EQUALS’, 
 fieldName: ‘user_name’,  // un-supported on API-level
 type: ‘INCLUDE’, 
 values: [‘John Doe’]
}

can be mapped to

{
 operator: ‘EQUALS’, 
 fieldName: ‘user_name’,  // supported on API-level
 type: ‘INCLUDE’, 
 values: [1]
}

Handling and validating a chain of filters

I already introduced the concept of pre- and post-filtering whereas pre-filtering is done when querying the API thus limiting the data being transferred and dramatically increasing the transfer speed and UI responsiveness. Contrary, post-filtering means applying filters once data has been retrieved from the API. This includes implementing filter handling inside your connector or return unfiltered data. GDS will apply any unhandled filters with a negative impact on UI responsiveness.

GDS allows different combinations of OR filters that are combined with an AND filter. Every single filter can be configured as an INCLUDE or EXCLUDE condition on a logical operator. If your connector supports some filters on API-level (pre-filtering; for instance user_id) and others not (post-filtering; for instance user_address), you need to handle filters separately while still being compliant to the overall filter conditions. For better understanding, I added some comments to the example data request.

{
 dimensionsFilters: [  // BEGIN-AND
  [  // BEGIN-OR
   {
    operator: ‘EQUALS’, 
    fieldName: ‘user_id’,  // supported on API-level
    type: ‘INCLUDE’, 
    values: [1234]
   },
   {
    operator: ‘EQUALS’, 
    fieldName: ‘user_address’,  // un-supported on API-level
    type: ‘INCLUDE’, 
    values: ["ABC"]
   }
  ]  // END-OR
 ],  // END-AND
 configParams: {},
 scriptParams: { 
   lastRefresh: 1588712095513
 },
 fields: [
   {name: ‘user_id’ }, 
   {name: ‘user_name’ }, 
   {name: ‘user_address’ }
 ]
};

The handling of pre-filters and post-filters is painful already since it is a bit more complex than the default filter design pattern, not to be confused with the intercepting filter pattern. On top of that, you will realize that GDS does not validate whether the specified filter combinations are logical in a human sense. For instance, a boolean field may be filtered by an INCLUDE condition on its true value combined with an EXCLUDE condition on the true value again using an OR operator. It's up to you to decide how to implement such cases in your custom code or how to evaluate both filters into a single one that may then be used as a pre-filter on API-level. You can still skip filtering at all and leave it to GDS.

{
 dimensionsFilters: [  // BEGIN-AND
  [  // BEGIN-OR
   {
    operator: ‘EQUALS’, 
    fieldName: ‘user_is_male’,  // supported on API-level
    type: ‘INCLUDE’, 
    values: [true]
   },
   {
    operator: ‘EQUALS’, 
    fieldName: ‘user_is_male’,  // supported on API-level
    type: ‘EXCLUDE’, 
    values: [true]
   }
  ]  // END-OR
 ],  // END-AND
 configParams: {},
 scriptParams: { 
   lastRefresh: 1588712095513
 },
 fields: [
   {name: ‘user_id’ }, 
   {name: ‘user_name’ }, 
   {name: ‘user_is_male’, filterOnly:true }
 ]
};

I decided to evaluate the OR operation as `1 OR 0 = 1`, whereas INCLUDE user_is_male == true is 1 and EXCLUDE user_is_male == true is 0, and use the evaluation result on API-level. My filter handling is based on the filter design pattern with a tree of filters in which each node is an OR/AND operation and each leaf is a filter matching a single logical condition. Each filter holds a pre-filtering function that adds parameters to the API query if supported and a match function that evaluates the logical filter operations on the received data row by row (e.g. post-filtering) breaking the filter tree if one filter fails or the OR/AND operation fails. The following figure shows the data request and filter process in a simplified version without data conversions and no local data caching.

Simplified data-request including pre-filtering and post-filtering (matching) but no data caching

Filtering on data fields only

Many hours passed by until I finally mastered the filtering. That’s what I thought. But after some testing, I figured out, that my custom API does support an HTTP GET query parameter, which is not necessarily included in the fields (data schema). Let’s imagine the following example. Your API manages customers, their postal address and other personal data. Furthermore, the API allows filtering by geographic regions. For instance, retrieving all customers from the Asia-Pacific region.

HTTP GET https://my.api/customers.json?region=apac

Even though a single data row on a customer includes all personal details, it may not include a field on the geographic origin of the customer. But GDS allows filtering only for those fields that are included in the data schema. The only solution I found so far is adding a custom parameter to the configuration of the connector. That parameter will be set when the data source is set up. Usually, the configuration and its parameters can not be changed on a later stage.

Luckily custom parameters can be allowed to be overridden. The value of the parameter can then be configured in the report. Unfortunately, the same parameter can not be used in a UI filter control and therefore not presented or selected by the user. But, it can be included as a customizable query parameter in the URL when sharing the report with others.

Caching Data

Google Apps Script Cache Service (low-level caching)

If your API comes with quota limits you must consider reducing the number of API calls. Apart from that, you should also consider the performance. Any retrieval of locally cached data is way faster than calling a remote API. Luckily Google offers the Apps Script Cache Service that comes with unlimited text-based storage that stores key-value pairs free of charge.

But there are limits on the keys and values. The maximum key length is 255 characters and the maximum value size is 100kb. Besides that, the maximum retention time is 6 hours. If the data on initial retrieval exceeds the maximum of 100kb but you still want to cache the data then you will think of various caching strategies and data splits.

Logically, each data row could be cached separately. In such a case you need to manage the row keys so that you can retrieve the correct rows from the cache on a subsequent data request. Another solution is converting your data to text and split it into chunks. For each chunk, you can add its index to the key. When retrieving the data from the cache you need to iterate through the index until the cache won’t return any further data. There is a nice example that I highly recommend.

Once you managed the maximum value size, you still may run into the maximum key length. If your custom connector does not support API-level query parameters (e.g. always retrieving your recently listened songs on Spotify for your user) then you don’t need any special key management. Since I implemented API-level filtering I’m using the HTTP query URL as the cache key. If that key exceeds 255 characters, I’m generating hashes or GUIDs while avoiding duplicated keys. My cache implementation is based on this posting but with extended key management.

If your connector serves multiple tenants (data domains) or data includes personal GDPR relevant data, make sure to use the cache service with the user realm or integrate the data domains in cache keys wisely.

Google Data Studio data freshness (high-level caching)

No matter if you are caching data in your custom connector, GDS will cache data as well. I couldn’t figure out when GDS will use its own cache and when it will request data from the connector. Furthermore, the retention times of GDS differ to those of the Apps Script Cache Service and the person viewing the report can not manually invalidate the GDS cache.

There is also no way to read the configured GDS retention time in your custom connector. In my implementation, I’m ignoring the GDS caching and its retention time. Instead, I added a custom parameter to the configuration of the connector that allows the definition of retention times or disabling of caching at all by specifying a value of zero. Nevertheless, I would prefer merging both caching strategies into one.

Last refresh (cache invalidation)

Any data request will contain a parameter on the most recent request. The value can be converted to DateTime and used for deciding whether to retrieve data from a local cache or from the remote API.

scriptParams: { 
   lastRefresh: 1588712095513
 },

There are two issues with that. First, you need to store a timestamp for each data request in the cache as well. Otherwise, you can not calculate a time span between the last and the current request. Second, You don’t need to work with timestamps if you are using the Apps Script Cache Service with a custom retention time. If the retention period has expired, local data retrieval will fail and data must be retrieved from the remote API anyway.

Development & Deployment

Versioning & Deployments

Any Google Apps Script can be edited, tested and deployed in the web-browser. The downside of that is, that you have one code version only. If you are going to manage different versions, you need to set up a local development environment using the Google Command Line Interface (clasp) and optionally the Google Data Studio Custom Connector Generator (dscc-gen) and push your code to a Git-repository or any other version control system.

Whenever you are working on one version, you need to push the code of that version to the Apps Script project. That version will automatically be used by any HEAD deployment of your script. You better make sure, that nobody is relying on the HEAD deployment for productive usage. For production, you can create new deployments. For the currently uploaded version of your code, you can then create versions in Apps Script and assign those to a dedicated deployment.

No switching between deployments

As mentioned, you can maintain different versions of your code and create deployments for them. But every data source that someone configured based on your custom connector will be linked to that deployment. For example, any data source for testing purposes will be linked to the HEAD deployment while you are still developing your custom connector. If you are then using those data sources inside a report, they are still linked to the HEAD deployment.

Once you are done, you may create a PRD deployment to keep the current working code separately from the continuous development and improvement of your code. So, all existing data sources and reports should be linked to the PRD deployment as well. But there is no way to do so. Instead, you need to create new data sources based on the PRD deployment and use them inside existing reports or create new reports.

You will even have problems to distinguish between the deployments when creating a data source. All active deployments will be listed as a data source in the Data Studio gallery. If you don’t change the name of the connector in the manifest, they will all look the same in the gallery. So I started using a prefix for all environments except for PRD.

The code of your custom connector will be visible as a project in the Apps Script interface as well as in Google Drive. That’s the starting point for investigating how your connector performs. Logs can be accessed and executions and errors are listed. But they are not arranged by the name of the deployments but based on the versions. If you are going to have only two deployments but let’s say 10 versions of code of which only 2 will be deployed, you will still see 10 tabs.

Debugging and error messages

A custom connector is a cloud project. It is volatile and will only be instantiated when GDS requests the data schema (fields), type of authentication or data. That’s why it is usually stateless while its configuration is managed by GDS and data is cached via the Google Apps Script Cache Service. You can still test and debug your code step-by-step in the Google Apps Script interface and all logs will be written to Stackdriver logging.

a very helpful error message *sarcasm off*

But sometimes you will face error messages that won’t give you any idea what’s actually wrong. Or sometimes you don’t get any error at all, but still, no data is being displayed. Even though Google’s documentation on Apps Script and on building a custom data connector is quite comprehensive, you may get frustrated due to such errors.

No custom user agent

The API that I was connecting to GDS via my custom connector requires HTTP GET requests with a custom user-agent. That agent is used to distinguish between different clients and to find their corresponding logs on the API side. Google Apps Script uses the UrlFetchService for HTTP GET and POST requests. Unfortunately, that service will override your user-agent (see code below). We found no other workaround than creating a new custom HTTP header.

var headers = {
 ‘User-Agent’: ‘my custom user agent’
};
 
var options = {
 headers: headers
};
 
var response = UrlFetchApp.fetch(url, options);

Authentication included

Most API’s require some kind of authentication. Sometimes just a username and password, but more often a user token or API key. I’ve seen lots of code in which someone hardcoded an API key. This is bad-practice since the key should be kept secret. This is, even more, a problem if the key has special access rights on the API than the developer who’s using the key should have (e.g. reading GDPR relevant data).

Apart from that, the various authentication flows are not that easy to understand, implement and test. Luckily Google takes away that pain. It supports authentication via API keys, user tokens, user name and password as well as OAuth 2.0 out of the box. Simply add a few methods to your code and you are done. GDS will display the correct user dialogues and take care of the auth flow. You only need to validate and store the credentials.

Sharing, Credentials & Publishing

There are different ways to share your custom connector either privately per person, on a corporate level, as an open-source community connector or as a commercialized connector for the GDS gallery. All of which assumes that you are going to share your connector as a template so that others can create their own GDS data sources based on that template.

I figured out, that there are other easier ways. Publishing your code open-source will require others to create own Apps Script projects and deployments into which your code is pushed. Sharing your project on corporate-level still requires others to create data sources on their own and you never know if they unintentionally use the HEAD deployment instead of PRD. Instead, create the GDS data sources yourself using the dedicated deployment and then share those data sources with other analysts. Or even use your data sources in a report and share that report with others.

Sharing your connector so that others can access your custom data source

Everyone who is using your shared data source or report will automatically have access to the data since your credentials will be used by the connector (owner’s credentials). No further authentication is needed. This is especially nice since your API key or credentials may have privileged access that is essential for the report but you still don’t want to share your credentials with others by email. If you still want others to provide their own credentials, you can configure the data source to ask for the viewer’s credentials.

Summary

My personal opinion is that GDS reports and data sources are extremely powerful. A developer does not need to care about target environments, virtual machines or containers nor about scaling up for performance reasons. And everything is for free. Implementing a custom connector can be done in just some hours if your custom API is simple, having just one data entity and not too many HTTP query parameters that you want to handle in your connector.

But whenever the API is quite complex you may think of implementing different connectors for each data entity. And, whenever the performance of the API and its quota limits highly depends on how data is being retrieved, you are better off with a custom script since it gives you full control whereas you don’t know how an analyst will use your connector in a report.