Tableau Data Catalog: Discovering Tableau’s metadata…
… and noticing its complexity
In the first article of this series, we’ve explained how the need for a Tableau Data Catalog emerged at iAdvize and why we decided to build it ourselves. This second article will focus on how we structured the queries to extract all the required metadata.
The main method
The data flow to expose data from Tableau Metadata API to our end-users is split into three key steps:
- A Python worker queries the API (GraphQL), flattens the data, and loads it into a dedicated dataset in our data warehouse (Google Big Query)
- A Tableau Certified datasource is created based on these tables
- A workbook connects to this datasource and displays the metadata to the Tableau user.
However, the API queries turned out to be more complex that we initially expected.
Our first try: underestimating the complexity of Tableau Metadata
With our methodology set, we started working on the first step, building the correct query to gather all the metadata needed. At first, we indeed thought one query would be enough to extract it all. So based on the API’s documentation we listed the Tableau objects we wanted in our Data Catalog:
- the published datasources,
- their source tables,
- the tables’ columns,
- their lineage with the published datasources’ fields,
- the calculated fields,
- the workbooks using it,
- the published datasources and workbooks owners,
The more we thought about our wishlist, the more complex it seemed to gather it and keep track of the objects’ relationships within a single query. As an example, this is how our first query looked like:
This first try raised two problems:
- first, the pagination issue. The result is monumental and exceeds the display capacity of the API, so it required a pagination method. But it appeared that the pagination method itself limits the result’s size, and the process of filtering the query enabling it to fit the display capacity of the API rapidly became a headache.
- Secondly, this query doesn’t gather all the metadata that we need. Among others, it neither covers the workbook-related objects nor the complex relationship between a field and a calculated field.
This first try taught us that the Tableau metadata schema is quite complex and can’t be approached with a single query. It came out that to properly extract this data, we had to understand the schema precisely and understand how to split it, creating as many queries as necessary to reflect that split.
In other words: from “one to rule them all”, we switched to “divide and rule”!
From one query to multiple tables, step 1: from an upstream table to a published datasource
Note: To improve the readability of this article, from now the elements of Tableau content that relate to an object in the Tableau Metadata API model will take an uppercase letter.
We believe that all projects start with a question, and it came out that the first question we’d like our Data Catalog to answer is: how is a published data source related to an upstream table?
You can check this documentation to get an exhaustive explanation of the elements of Tableau’s content, but in this article, we’d like to define some of them that will be useful.
A Tableau data source can be one of two kinds: published or embedded. A Published Data Source is hosted on a Tableau Server. It is here where connections become established with workbooks. On the other hand, an Embedded Data Source is wrapped into a workbook and can only be used in this workbook specifically. At iAdvize, we mainly use Published Data Sources for certified content, so we decided to focus on it for our Data Catalog.
A Published Data Source connects to the ‘origin’ of the data, namely a (or several) Table(s). A Table can be of any type Tableau can swallow: a table from our data warehouse or any database, a Google Sheet, a .csv file, a Salesforce object…
One last thing about Published Data Sources: they can connect to their source Table(s) via a live connection or an extract. With a live connection, the data isn’t stored on Tableau Server. Instead, the connection instructions become integrated into the Published Data Source. With the extract method, the Published Data Source contains the data.
Now, let’s go deeper into the Metadata API vocabulary: a Table at the origin of a Published Data Source is called its Upstream Table, as a river of data coming from uphill. This is not an exclusive relationship: a Table can link to multiple Published Data Sources and the other way round.
Finally, I’d like to introduce two last key concepts that could be useful while referencing a Published Data Source in a Data Catalog: the User who owns one or many Published Data Source(s) and the Tags associated with this content.
That’s it! We set the stage for the presentation of the first tables of our database:
This diagram shows only the tables which contain the content of each object. To get their relationships, we added two tables:
- link_upstream_table_published_datasource, which contains one row per connection between a Published Data Source and a Table;
- link_tag_published_datasource, which contains one row per tag attributed to a Published Data Source.
A link_user_published_datasource was not mandatory, as there can be only one User owning a Published Data Source.
Playing with Tableau Metadata API
Let me start with a disclaimer: the API content is handled completely by Tableau. Thus, its availability could be impacted by the future software releases. The queries we present here are consistent with version 2021.3.
Here are the GraphQL queries we use to get the fact tables:
They are quite simple as they collect fields (id, description…) from basic objects (publishedDatasources, tags).
We prefer to linger on the link queries:
To get the Upstream Tables from each of our Published Data Sources, we started on the object publishedDatasource from which we picked up the id field of each connected upstreamTable. Note that we could have done it the other way round: starting with the databaseTable object and exploring the downstreamDatasources. In either case, we can flatten the result and get a link table with one row per connection between a publishedDatasource and an upstreamTable.
The relationship with the tag and the publishedDatasource is simpler, but we could also have done it the other way.
These queries feed an ETL that flattens the data and stores it into a dedicated dataset in our data warehouse.
Let the show begin!
With these metadata, we can now provide answer to these questions:
- Which published data sources have extracts?
- Which published data sources are in this Tableau Server project?
- How many published data sources are certified?
- How many published data sources use the tag « GoogleSheet »?
- What is the main connection type between our published data sources and the upstream tables?
- If there is an issue with an upstream table in our data warehouse, how many certified published data sources are impacted?
The list is long, but you got the idea. As an illustration, here is how our brand new tables answered the last question:
In our upcoming article, you will find more information about how we gathered and organized metadata to explore the relationship between a data source field and a complex calculated field.