Google Data Catalog & Sisense metadata integration

Leverage Data Catalog to discover, track lineage, and annotate Sisense assets

Ricardo Mendes
Google Cloud - Community
11 min readJan 11, 2022

--

Background photo by Lauren Mancke on Unsplash

In the below paragraphs, I share thoughts and knowledge that drove the development of one more sample connector for Google Data Catalog: the Sisense connector.

The highlight, in this case, is a column-level data lineage mechanism that allows Sisense users to easily track dependencies between ElastiCube table fields and the Widgets or Dashboards in which they are used.

It is the fourth open-source connector that allows customers to ingest metadata from BI Systems into Data Catalog: the ones for Looker, Tableau, and Qlik Sense have been available for a while.

Disclaimer: Google and/or Google Cloud do not officially support any tool to connect Data Catalog to non-GCP systems at the time this blog post has been written down (January 2022). What you will find here is merely the result of my experience as a Data Catalog seasoned user.

GENERAL OVERVIEW

The connector consists of a Python package bundled with a command-line interface. Source code is available on GitHub. Instructions on installing and running it are available there as well.

Long story short, it:

  1. Scrapes metadata from a given Sisense server by consuming its REST API.
  2. Prepares the information to fit the catalog entity model.
  3. Ingests the prepared metadata into Data Catalog through its API.
Architecture overview: Google Data Catalog and Sisense metadata integration
Image 1. Architecture overview: Google Data Catalog and Sisense metadata integration (image courtesy of the author)

The minimum requirements to run the connector are listed below:

  1. Sisense: a login/password pair with reading access to the Sisense REST API on a given server.
  2. Data Catalog: a Service Account allowed to create Entry Groups, Entries, Tag Templates, and Tags (please refer to the documentation in case you’re not familiar with Data Catalog roles).
  3. connector: a Python 3.6+ or Docker environment.

SISENSE ASSET TYPES AND RELATIONSHIPS

Let me briefly describe some assets available on Sisense. Properly understanding their meaning and how they relate to each other will help you keep up with the coming sections.

Sisense asset types and relationships
Image 2. Sisense asset types and relationships (image courtesy of the author)
  • Folders are grouping structures that contain Dashboards and subfolders.
  • Dashboards are visual representations of data to tell a story. They present complex data in easy-to-understand ways so that users can discover patterns, trends, and anomalies that are occurring in the business.
  • Widgets, the main components of a Dashboard, are single units that represent one data category. They present data in the format of charts, tables, maps, text, etc.
  • ElastiCubes are Sisense’s unique analytics databases. They boost native support for a wide range of data sources, providing BI Analysts a simplified query interface no matter where data comes from.

Fields, Filters, and Formulas are the bonds between ElastiCube table fields, Widgets, and Dashboards:

  • Fields bind Widgets to their underlying data sources (ElastiCube columns).
  • Filters allow users to select data at a Dashboard or Widget level.
  • Formulas allow field calculations to be performed on the fly.

Fields, Filters, and Formulas are not handled as first-class assets in the scope of this work. However, their metadata is a foundational building block for the lineage mechanism that tracks dependencies between ElastiCube table fields and their related Dashboards and Widgets.

DESIGN WORK

When it comes to systems integration, one of the initial steps consists of design work. We need to understand the concepts and entities from each system and how they fit the other end. The Sisense asset types and their relationships were described in the previous section. I assume the reader has an elementary understanding of Google Data Catalog, but in case you don’t please have a look at this blog post.

The connector handles three first-class Sisense asset types in its initial release:

  1. Folders
  2. Dashboards
  3. Widgets

We see ElastiCubes as first-class assets as mentioned in the previous section. Still, due to time constraints, their metadata is not fully handled by the connector in the current version (0.1.0). ElastiCubes are searchable through Google Data Catalog by some means, but there’s no information about their upstream connections, for instance. Downstream dependencies, however, are mapped in the format of Dashboards and Widgets metadata.

Regardless of the BI System asset type, Data Catalog counts on the three classes highlighted below to organize and store its metadata:

Entry, Tag Template, and Tag: the main Google Data Catalog classes
Image 3. Entry, Tag Template, and Tag: the main Google Data Catalog classes (image courtesy of the author)

That being said, there will be a single catalog Entry for each first-class asset. It will capture the most relevant metadata that describes the asset, such as a Display Name, Description, dates when it was created and last modified, the System used to manage it (entry.user_specified_system = 'sisense'), and its Type: entry.user_specified_type = 'dashboard' or entry.user_specified_type = 'widget', for instance.

Entries have standard fields that cannot capture all metadata from the assorted Sisense assets. Here Tags come to help us: they are flexible structures we can use to annotate catalog entries, thus extending the standard Entry fieldset. As an example, Dashboards have a string field indicating the ElastiCube they are connected to, which is not covered by any Entry field. We can then attach a Tag with a datasource field to a Dashboard-related Entry in order to persist such metadata; a simple approach that avoids losing relevant information when integrating Sisense and Data Catalog. The same reasoning applies to a variety of fields readable through Sisense’s REST API.

Tags are created from templates, which can be thought of as their defining elements. Tag Templates are used to specify the names and types of the Tag fields that will actually hold metadata. The below code snippet (simplified to improve readability) brings a Tag Template intended to represent Dashboards’ metadata:

Code snippet 1. Make a Tag Template in Google Data Catalog to annotate Sisense Dashboard-related Entries

The code that creates the Tag Templates used by the connector can be found here. You will notice there is a template for each first-class asset type.

You will also notice a Tag Template for JAQL metadata. JAQL queries are commonly used by the widgets to execute queries against the Elasticubes, so we decided on them as the initial support for the data lineage mechanism — it’s noteworthy we believe there is room for parsing SQL queries as a future improvement. Tags from the JAQL Tag Template are applied to Fields, Filters, and Formulas, which are mapped as schema columns in the Dashboard and Widget-related catalog Entries.

Please refer to the connector’s documentation for a list of all Tag Templates and their fields.

Lastly, there are relationships between the Sisense assets — most notably parent-child. Although Data Catalog so far does not support relationships between Entities, the connector addresses some of them through Tags. Three Tag fields usually link assets to their parents (aka upstream): the parent’s Id, Name, and a URL that allows the user to navigate to the related Entry using Data Catalog’s UI. Taking the above Tag Template as an example: a Folder, which is a Dashboard’s parent, is identified by the folder_id, folder_name, and folder_entry fields. If you look at the source code or run the connector, you will notice the same pattern is used in other templates to mimic resembling parent-child relationships.

THE SAMPLE CONNECTOR

Now it is time to look at some technical details. Given this is a sample connector, the source code and reference documentation are intended to be clear enough so anyone can understand them. But, as we are talking about software, some matters are still worth clarifying.

Solution architecture

Four main components are used to settle the solution, as shown in the below diagram. The yellow box represents the Sisense REST API; gray boxes represent connector-specific resources; the blue box is the Data Catalog API.

Components of the Sisense connector for Google Data Catalog
Image 4. Components of the Sisense connector for Google Data Catalog (image courtesy of the author)
  1. Sisense REST API: a remote component that allows the connector to read metadata from all Sisense assets.
  2. google-datacatalog-sisense-connector: works as the orchestrator of the scrape prepare ingest workflow mentioned at the beginning of the present blog post. It contains the code that allows users to start the connector through the command-line interface. But, much more than this, code that handles authentication, metadata reading (scrape stage), and Sisense to Data Catalog entity translating (prepare stage). Metadata ingestion (the ingest stage) is delegated to the google-datacatalog-connectors-commons component.
  3. google-datacatalog-connectors-commons: code shared by all Data Catalog open-source connectors, comprising metadata ingestion support and execution monitoring tools.
  4. Data Catalog API: a remote component that allows the connector to manage Entries for user-specified resource types, Tag Templates, and Tags.

CLI operations

In addition to a synchronize operation provided in the other Data Catalog connectors, the Sisense connector offers two extra command-line options. In a nutshell:

  1. sync-catalog synchronizes Google Data Catalog with a given Sisense server.
  2. find-elasticube-deps finds Widgets’ fields or filters or Dashboards’ filters that depend on a given ElastiCube table or field. Then, prints the results in the console.
  3. list-elasticube-deps searches Data Catalog to understand the Widgets’ or Dashboards’ data definition based on JAQL queries and table fields. It also prints the results in the console.

The first operation represents a minimum requirement for such a connector — synchronizing metadata between the two systems — and is exactly the same feature provided by all preexisting connectors.

The second and third operations are unique features of the Sisense connector, made possible due to its column-level lineage mechanism powered by Sisense-provided JAQL queries metadata at the Field, Filter, and Formula level.

You will find more details on how to use the CLI in the connector’s reference documentation.

Tradeoffs

The sync-catalog operation creates JAQL metadata Tags for most of the Dashboard and Widget properties that depend on JAQL queries, i.e., Fields, Filters, and Formulas, to enable column-level lineage tracking. Each of the aforementioned properties needs at least one of these Tags. There is an increased amount of API calls to create such Tags when compared to similar connectors. On the other hand, the Tags are quite small: ~4 fields each. Given API calls and metadata storage are key components of Data Catalog’s pricing model, we strongly recommend referring to the product documentation to evaluate costs depending on your environment characteristics and usage needs.

RESULTS

In this section, you will find a preview of the connection execution results. To get started, please have a look at the below screenshot. It brings part of a sample Ecommerce Dashboard (some Widgets were suppressed to keep the image succinct).

Sample Widgets from a Sisense Dashboard
Image 5. Sample Widgets from a Sisense Dashboard (image courtesy of the author)

Next, you can see the results of performing a catalog search after running the sync-catalog operation. I’ve used the system=sisense type=widget tag:dashboard_title:"Sample - Ecommerce" query to look for all Widgets that belong to the Sample - Ecommerce Dashboard. The Widgets presented in the above screenshot are highlighted in the results.

Search Sisense Widgets using Google Data Catalog
Image 6. Search Sisense Widgets using Google Data Catalog (image courtesy of the author)

Similar queries might be used to look for other asset types, e.g., system=sisense type=folder or system=sisense type=dashboard. Refer to the Data Catalog search syntax documentation to find more options.

Clicking one of the Entries will bring up its metadata. Please notice the Entry has a schema. Instead of database columns, it represents the REVENUE vs. UNITS SOLD Widget Fields. The same design is used to ingest Filters and Formulas metadata into Google Data Catalog.

Column-level data lineage implemented in the Sisense connector for Google Data Catalog
Image 7. Column-level lineage implemented in the Sisense connector for Google Data Catalog (image courtesy of the author)

I guess you’ve also noticed each field is annotated with a Tag created from the Sisense JAQL Metadata Template. Clicking Total Revenue’s Tag allows us to find where its data come from: Commerce table, Revenue field. This is an elementary feature when I say the connector offers data lineage support.

Additional data lineage features

Two more lineage-related features are available through the CLI. To demonstrate the first one, let’s say someone wants to know all the Fields, Filters, or Formulas that depend on the Commerce.Revenue table field. She can perform a catalog search through the UI, but the results will be presented at the Entry level, and she would need to manually inspect all the Entries to get the answer.

The find-elasticube-deps operation intends to simplify such a search, as shown in the below screenshot.

Sisense connector for Google Data Catalog: Sample results for the “find-elasticube-deps” operation
Image 8. Sample results for the “find-elasticube-deps” operation (image courtesy of the author)

Please notice:

  1. The --table Commerce and --column Revenue arguments were provided, and the operation accepts an optional --datasource argument. Either a data source, table, or column argument must be provided.
  2. Seven results were found, which means the table field is used by a bunch of Widgets or Dashboards.
  3. Only matching Fields or Filters are included in the resulting list, saving user time when compared to using the UI (results from 2 to 7 were suppressed to improve readability).

Now, let’s say you want to know what ElastiCube resources are used to populate the Dashboard illustrated in Image 5. Someone might need a handful of searches and clicks to get the job done through the Data Catalog UI. The second additional feature, list-elasticube-deps, brings an effortless option to get the answer, as shown below.

Sisense connector for Google Data Catalog: Sample results for the “list-elasticube-deps” operation
Image 9. Sample results for the “list-elasticube-deps” operation (image courtesy of the author)

Please notice:

  1. The --asset-url https://__REDACTED__.sisense.com/app/main#/dashboards/__REDACTED__ argument (URL copied from the browser), which is mandatory.
  2. Ten results were found, which means one Entry for the Dashboard and nine for its Widgets.
  3. All the Fields and Filters that depend on ElastiCube resources are listed in the results. Similar to the former feature, this one saves user’s time when compared to using the UI (results from 2 to 10 were suppressed to improve readability).

IN CLOSING

The ability to discover data and understand how it flows in today’s complex environments is a must-have for most Big Data and Analytics teams. Google Data Catalog provides powerful search capabilities and counts on a rich ecosystem that allows users to discover and annotate data assets managed both on Google Cloud and external systems (through 15+ open-source connectors).

Automatically tracking data lineage is still a challenge, though. Experience shows that the results depend, among other things, on the stack used by the team. For example, the column-level lineage implemented in the Sisense connector was only possible due to the way Sisense binds data to visual components and makes metadata available through its API.

A tool that integrates Sisense and Data Catalog then allows users of both systems to check, in a matter of seconds, how data source schema changes might impact production dashboards, just to name an immediate benefit. For teams that leverage a culture of experimentation and short business insights delivery cycles, having such information in near real-time can be a game-changer.

I hope you enjoy it!

References

--

--