Tableau Data Catalog: Harvesting Tableau fields

Various field types grow in our Tableau content

iAdvize Engineering


In our second article — that you can find here -, we detailed the first step in our quest of leveraging Tableau metadata at iAdvize: getting the connection between a published data source and an upstream table.

In this new article, we will dive into getting the required metadata to answer our second question: what is the relationship between a data source field and a complex calculated field?

In our organization, this topic interests two categories of users:

  • the Data team, who wants to track the impact of the deletion (or edition) of a field for the calculated fields using it;
  • the Creators, to explore how a given calculated field is built.

This part of our Data Catalog project is the most complex, so we’ll start with some detailed definitions.

Our Data Catalog harvesting Tableau fields, seen from the sky (Photo by Raland on Shutterstock)

Tableau fields are diverse

Please note again that the content related to an object in the Tableau Metadata API model will take an uppercase letter in this article. Some of them are defined in the previous article.

Also, for more clarity, we may use terms from our internal vocabulary at iAdvize that may differ from the terms used in the API documentation. We will have our internal vocabulary displayed in italic letters.

Understanding Tableau fields

A Tableau Field can be of three kinds:

  • a non-calculated data source field wraps into a PublishedDatasource. It directly connects to a column in an UpstreamTable linked to the PublishedDatasource, namely a UpstreamColumn. Example from the Sample-Superstore data source provided by Tableau Desktop: the « Customer Name » is a non-calculated data source field
  • a calculated data source field is computed with one or many non-calculated data source field(s) and/or other calculated data source field(s). It is useful for the data source’s creator, as it provides the users with a common calculated field directly hosted in the PublishedDatasource. The non-calculated fields used by calculated fields are named « upstream data source fields ». Example from the Sample — Superstore data source: « Profit Ratio » is a calculated data source field, computed with « Profit » and « Sales » (the upstream data source fields)
  • on the contrary, a workbook field doesn’t exist in a PublishedDatasource as it is created by a Creator user in a specific workbook. Thus, it is always a calculated field. Example from the Sample — Superstore data source: « Ship Month » is a calculated field we created in a workbook by extracting the month name from the « Ship Date » non-calculated data source field.

Following these definitions, we wanted to get the three fact tables — one per Field type — in our Data Catalog, plus link tables that would allow us to get the connections between a calculated field and the fields contained in its formula (calculated or not). However, it came out that the Tableau Metadata API doesn’t easily allow us to organize our metadata that way.

Indeed, Tableau Metadata API doesn’t explicitly distinguish between a workbook calculated field and a data source calculated field.

Our hack

The API makes two objects available to us:

  • the “fields” field from the PublishedDatasource object allows us to collect information from all data source fields (calculated and non-calculated). We can distinguish between both types thanks to the “upstreamFields” field (if empty, it surely is a non-calculated field). Note that the “datasourceField” object from the API doesn’t have the same definition as our data source field (more info here);
  • calculatedField provides details about all CalculatedFields (from a data source or a workbook). We didn’t find a way to separate the two kinds of CalculatedFields.

But don’t worry, we’ve found a way to work around it! It is summarized in this diagram:

In a nutshell: a workbook calculated field is a calculated field that is not a data source field. So we only need two tables to get the three types of field: dim_datasource_field and dim_calculated_field. An appropriate outer join will then do the job.

The diagram

This diagram shows only the tables which contain the content of each object. To get the relationships between these tables, one link table was added: link_upstream_datasource_field_calculated_field, which contains one row per connection between a data source field and a calculated field (from a data source or a workbook). Note that a link_upstream_column_datasource_field was not mandatory, as there can be only one upstream column at the source of a non-calculated data source field.

Playing with Tableau Metadata API

The content available in the API is fully handled by Tableau, thus its availability and the exposed fields may be impacted by the future software releases. The queries we present here are consistent with version 2021.3.

This GraphQL query allows us to collect the metadata from data source fields:

As mentioned before, we start with the PublishedDatasource object to gather the data source fields informations.

The queries we use to get the calculatedFields and upstreamColumns metadata are simpler as we directly use the calculatedFields and upstreamColumns objects:

Note that we also collect the Table id from the upstreamColumn to connect it with the upstreamTable metadata.

As for the link query, we have to use the upstreamFields field from the calculatedFields object:

These queries feed an ETL that flattens the data and stores it into a dedicated dataset in our data warehouse.

This diagram shows how the first two steps of our Data Catalog database come together:

Let the show begin!

Thanks to this new database, we can answer the following use cases:

  • a member of the Data team wants to check if the fields’ definitions are consistent among the data sources
  • a Creator user would like to know the formula of a data source calculated field without having to connect to the data source
  • a member of the Data team would like to remove the field « Client Country » from a data source and know which calculated fields would be impacted
  • a Creator user wants to know in which published data sources they can find a field called « Number of Conversations »

Here is a homemade search engine created with Tableau Desktop to illustrate the last example:

What’s next?

So now that we are able to link a data source to an upstream table and the field types, could we establish the links between the fields and the connected workbooks?