Google Data Catalog and Qlik Sense metadata integration

Leverage Data Catalog to discover & annotate Qlik Sense assets

Published in

Google Cloud - Community

10 min readJan 14, 2021

Background photo by Lauren Mancke on Unsplash

In this blog post, I’m going to share thoughts and knowledge that drove the development of a new sample connector for Google Data Catalog: the Qlik Sense connector.

This is the third open-source connector that allows customers to ingest metadata from BI Systems into Data Catalog: the ones for Looker and Tableau have been available for a while.

Disclaimer: Google and/or Google Cloud do not officially support any tool to connect Data Catalog to non-GCP systems at the time this article has been written down (January 2021). What you will find here is merely the result of my experience as a Data Catalog adopter.

GENERAL OVERVIEW

The connector consists of a Python package bundled with a command-line interface. Source code is available on GitHub. Instructions to install and run it are available there as well.

Long story short, it:

Scrapes metadata from Qlik Sense by consuming their REST and Websocket APIs.
Prepares the information to fit the catalog entity model.
Ingests the prepared metadata into Data Catalog using its API.

**Image 1**. Architecture overview: Google Data Catalog and Qlik Sense metadata integration

These are the minimum requirements to run the connector:

Qlik Sense: a login/password pair with reading access to the Qlik Sense Repository Service (QRS) and Qlik Engine JSON APIs on a given site.
Data Catalog: a Service Account allowed to create Entry Groups, Entries, Tag Templates, and Tags (please refer to the documentation in case you’re not familiar with Data Catalog roles).
connector: a Python 3.6+ or Docker environment.

QLIK ASSET TYPES

Let me briefly describe some assets usually available on a Qlik site. Properly understanding what they mean and how they relate to each other will help you keep up with the coming sections.

**Image 2**. Qlik Sense asset relationships

Streams are top-level elements; they are collections of Apps — also known as Documents.
Apps are containers for components used to load, process, and display data. Example components include Scripts executed to connect to data sources and extract data into the document, and Sheets.
Streams and Apps have standard properties such as Name, Id, Owner, and Description, but users can leverage the so-called Custom Property Definitions to extend the standard metadata set according to their needs. When managing Custom Property Definitions, users give them a name a list of acceptable values.
Custom Properties are the associations between Custom Property Definitions, Streams, or Apps. They materialize the user-extended metadata.
Sheets are containers for objects which are charts of various types.
The objects used to populate the Sheets can be built individually (hence used in a single Sheet) or may come from a library within the App. Those belonging to such a library are called Master Items and split into three types: Dimensions, Measures, and Visualizations.

DESIGN WORK

When it comes to systems integration, one of the initial steps consists of design work. We need to understand the concepts and entities from each system and how they fit the other end. The Qlik asset types were described in the previous section. I assume the reader has an elementary understanding of Google Data Catalog, but in case you don’t please take a look at this blog post.

The connector handles 7 Qlik asset types in its initial release:

Custom Property Definitions
Streams
Apps (only the published ones)
Master Items: Dimensions
Master Items: Measures
Master Items: Visualizations
Sheets (only the published ones)

Data Catalog, on the other hand, uses three main classes to get its job done:

**Image 3**. Entry, Tag Template, and Tag: the main Google Data Catalog classes

There will be a single catalog Entry for each Qlik asset. It will capture the most relevant metadata that describes the asset, such as a Display Name, Description, dates when it was created and last modified, the System used to manage it (entry.user_specified_system = 'qlik'), and its Type: entry.user_specified_type = 'sheet' or entry.user_specified_type = 'dimension', for instance.

Entries have standard fields that cannot capture all metadata from the assorted Qlik assets. Here Tags come to help us: they are flexible structures we can use to annotate catalog entries, thus extending the standard Entry fieldset. As an example, Apps have a boolean field indicating whether they were published or not, which is not covered by any Entry field. We can then attach a Tag with a published field to an App-related Entry in order to preserve such metadata; a simple approach that avoids losing relevant information when integrating Qlik and Data Catalog. The same reasoning applies to a variety of fields readable through the Qlik APIs.

Tags are created from templates, which can be thought of as their defining elements. Tag Templates are used to specify the names and types of the Tag fields that will actually hold metadata. The below code snippet (simplified to improve readability) brings a Tag Template intended to capture Apps’ metadata:

Code snippet 1. Make a Tag Template in Google Data Catalog to annotate Qlik App-related Entries

The code that creates Tag Templates currently used by the connector can be found here. You will notice there is a template for each Qlik asset type.

Additional templates are created for each Custom Property Definition’s Choice Value as long as they are assigned to at least one Stream or App:

The rationale behind this decision comprises allowing the connector to synchronize all metadata scraped from each Custom Property, at the same time it enables Qlik assets to be easily found by their custom properties — using query strings such as tag:property_name:"<PROPERTY-NAME>" and tag:value:"<SOME-VALUE>".
— connector docs

Lastly, there are relationships between the Qlik assets — most notably: parent-child, association, and composition. Although Data Catalog so far does not support relationships between Entities, the connector addresses some of them through Tags. In summary:

Three Tag fields usually link assets to their parents (aka upstream): the parent’s Id, Name, and a URL that allows the user to navigate to the related Entry using the Data Catalog UI. Taking the above Tag Template as an example: a Stream, which is an App’s parent, is identified by the stream_id, stream_name, and stream_entry fields. If you take a look at the source code or run the connector, you will notice the same pattern is used in other templates to mimic resembling parent-child relationships.
Custom Properties can be seen as associations between their definitions and the Streams or Apps they help to classify. Such associations are represented by specific Tags.
Compositions — e.g. which Dimensions, Measures, and Visualizations are used to build a given Sheet — are not addressed by the connector yet. This deserves a note, though: our experience with the Qlik Engine API let us believe it might be feasible adding such support in future versions.

THE SAMPLE CONNECTOR

Now it is time to look at some technical traits. The source code is intended to be clear enough so anyone can understand it. However, we are talking about software, and some matters are still worth clarifying. You will find some Qlik-specific quirks in the coming paragraphs; I’ll try to add as much context and meaningful links as possible, but one might still need to refer to the Qlik Sense for Developers Help for a deeper reading.

Components architecture

Six main components are used to settle the solution, as shown in the next diagram. Green boxes represent Qlik resources; gray boxes, connector specific resources; the blue box, a Data Catalog resource.

**Image 4**. Components of the Qlik Sense connector for Google Data Catalog

Qlik Sense Proxy Service (QPS) API: a REST API used for authentication purposes. It issues tickets that can be used as HTTP cookies or headers to authenticate subsequent requests to the QRS and Engine APIs.
Qlik Sense Repository Service (QRS) API: the connector reads Streams, Apps, Custom Property Definitions, and Custom Properties metadata from this REST API.
Qlik Engine JSON API: a WebSockets API that allows the connector to read metadata related to the Sheets and Master Items — Dimensions, Measures, and Visualizations.
google-datacatalog-qlik-connector: works as the orchestrator of the scrape → prepare → ingest workflow mentioned at the beginning of the present blog post. It contains the code that allows users to start the connector using the command line interface. But, much more than this, code that handles authentication, metadata reading (scrape stage), and Qlik to Data Catalog entity translating (prepare stage). Metadata ingestion (ingest stage) is delegated to the google-datacatalog-connectors-commons component.
google-datacatalog-connectors-commons: code shared by all Data Catalog open-source connectors, comprising full metadata ingestion support and execution monitoring tools.
Data Catalog API: allows users to create and manage Entries for custom data resource types, as well as Tag Templates and Tags.

User authentication

The connector currently uses NTLM to authenticate the Qlik user, which is the default authentication module for Qlik Sense Enterprise on Windows. NTLM also allows users to connect from outside a trusted zone. Thus, starting with NTLM was a design decision to reach a higher audience since the initial version.

For those looking for implementation details: the main authentication logic is handled by the Authenticator class, including the x-Qlik-Xrfkey header setup.

There are other options, though. After reading a lot of stuff in community forums we’ve noticed most programmatic interactions with Qlik Sense Enterprise on Windows seem to happen under certificate-based authentication instead of login/pass. JWT-based authentication seems to be achievable as well, but either option requires further investigation and testing.

REST vs WebSocket APIs

The components numbered 2 and 3 in the above list let it clear we use two APIs to read metadata throughout the scrape stage.

We tried to get as much metadata as possible from the Qlik Sense Repository Service (QRS) API. It implements a REST protocol, which is widely known by developers and easy to consume.

Please refer to the RepositoryServicesAPIHelper class to better understand what I mean by “easy to consume”.

The Qlik Engine JSON API, on the other hand, implements a WebSockets protocol and requires asynchronous programming. We took a while to get up to speed with it in the early days but then came to a solution that showed to be productive: BaseEngineAPIHelper is the base class of all Engine API interactions; its methods promote code reuse, leaving to the child classes the responsibility of handling only specific stuff, as you can see in the EngineAPISheetsHelper, EngineAPIDimensionsHelper, EngineAPIMeasuresHelper, and EngineAPIVisualizationsHelper classes.

RESULTS

In the next picture, you can see the results of performing a catalog search after running the connector. I’ve used the system=qlik type=stream query to look for all Streams available in a Qlik site in our testing environment.

Search all Qlik Streams using Google Data Catalog — **Image 5**. Search all Qlik *Streams* using Google Data Catalog

Similar queries might be used to look for other asset types:

system=qlik type=app
system=qlik type=custom_property_definition
system=qlik type=sheet
type=dimension
type=measure
type=visualization

Clicking one of the Entries will bring up its metadata:

Viewing Qlik Stream metadata in Google Data Catalog — **Image 6**. Viewing Qlik *Stream metadata in* Google Data Catalog (some fields were redacted to enforce privacy)

Please notice the example Stream is labeled with the GCPUser Custom Property in Qlik Sense; hence its catalog Entry is also annotated with a GCPUser-related Tag in Data Catalog (collapsed in Image 6). We can even look for all assets labeled with that same Custom Property, something users cannot achieve using only the Qlik Sense features. Let me change the query string to tag:property_name:GCPUser to show how it works:

**Image 7**. Search all Qlik assets labeled with a given Custom Property

I can search by Custom Property value, too — e.g. tag:value:true. There is a variety of search possibilities in Data Catalog, by the way. Please refer to the search syntax documentation for details.

Before closing this section, it is worth mentioning users can annotate Qlik assets with their own Tag Templates and Tags, leveraging Data Catalog to reinforce corporate Data Governance policies, for example.

WRAPPING UP

I’ve described challenges and decisions my team faced in the past few months when working on a solution to integrate Qlik Sense and Google Data Catalog. We have developed a sample connector that can be used to read metadata from Qlik Sense and store it in Data Catalog, making Qlik assets fully discoverable in Google Cloud.

The initial version handles 7 regular Qlik asset types, but there is always room for improvement. For example, we have investigated the possibility of parsing metadata from scripts and their underlying connections; this could result in elementary data lineage support and the connector would help to quickly answer questions such as “Which data sources are used by this App?” or “Which Qlik Apps would break if we drop this database table?”. Also, I believe it’s worth recapping the possibility of handling the compositions between the Master Items and Sheets: the required metadata seems to be available in the Qlik Engine API, but further researches are still needed. We didn’t have time to work on these topics so far…

By the way, in case you have suggestions or find bugs while running the connector, please don’t hesitate to file a feature request or issue. Feedback is always welcome!

Finally, I didn’t explore possible deployment solutions in this blog post but periodic executions can be easily scheduled given the connector ships with a command-line interface.

I hope you enjoyed the reading. Many thanks!