Google Data Catalog and Qlik Sense metadata integration
Leverage Data Catalog to discover & annotate Qlik Sense assets
In this blog post, I’m going to share thoughts and knowledge that drove the development of a new sample connector for Google Data Catalog: the Qlik Sense connector.
This is the third open-source connector that allows customers to ingest metadata from BI Systems into Data Catalog: the ones for Looker and Tableau have been available for a while.
Disclaimer: Google and/or Google Cloud do not officially support any tool to connect Data Catalog to non-GCP systems at the time this article has been written down (January 2021). What you will find here is merely the result of my experience as a Data Catalog adopter.
GENERAL OVERVIEW
The connector consists of a Python package bundled with a command-line interface. Source code is available on GitHub. Instructions to install and run it are available there as well.
Long story short, it:
- Scrapes metadata from Qlik Sense by consuming their REST and Websocket APIs.
- Prepares the information to fit the catalog entity model.
- Ingests the prepared metadata into Data Catalog using its API.
These are the minimum requirements to run the connector:
- Qlik Sense: a login/password pair with reading access to the Qlik Sense Repository Service (QRS) and Qlik Engine JSON APIs on a given site.
- Data Catalog: a Service Account allowed to create Entry Groups, Entries, Tag Templates, and Tags (please refer to the documentation in case you’re not familiar with Data Catalog roles).
- connector: a Python 3.6+ or Docker environment.
QLIK ASSET TYPES
Let me briefly describe some assets usually available on a Qlik site. Properly understanding what they mean and how they relate to each other will help you keep up with the coming sections.
- Streams are top-level elements; they are collections of Apps — also known as Documents.
- Apps are containers for components used to load, process, and display data. Example components include Scripts executed to connect to data sources and extract data into the document, and Sheets.
- Streams and Apps have standard properties such as Name, Id, Owner, and Description, but users can leverage the so-called Custom Property Definitions to extend the standard metadata set according to their needs. When managing Custom Property Definitions, users give them a name a list of acceptable values.
- Custom Properties are the associations between Custom Property Definitions, Streams, or Apps. They materialize the user-extended metadata.
- Sheets are containers for objects which are charts of various types.
- The objects used to populate the Sheets can be built individually (hence used in a single Sheet) or may come from a library within the App. Those belonging to such a library are called Master Items and split into three types: Dimensions, Measures, and Visualizations.
DESIGN WORK
When it comes to systems integration, one of the initial steps consists of design work. We need to understand the concepts and entities from each system and how they fit the other end. The Qlik asset types were described in the previous section. I assume the reader has an elementary understanding of Google Data Catalog, but in case you don’t please take a look at this blog post.
The connector handles 7 Qlik asset types in its initial release:
- Custom Property Definitions
- Streams
- Apps (only the published ones)
- Master Items: Dimensions
- Master Items: Measures
- Master Items: Visualizations
- Sheets (only the published ones)
Data Catalog, on the other hand, uses three main classes to get its job done:
There will be a single catalog Entry for each Qlik asset. It will capture the most relevant metadata that describes the asset, such as a Display Name, Description, dates when it was created and last modified, the System used to manage it (entry.user_specified_system = 'qlik'
), and its Type: entry.user_specified_type = 'sheet'
or entry.user_specified_type = 'dimension'
, for instance.
Entries have standard fields that cannot capture all metadata from the assorted Qlik assets. Here Tags come to help us: they are flexible structures we can use to annotate catalog entries, thus extending the standard Entry fieldset. As an example, Apps have a boolean field indicating whether they were published or not, which is not covered by any Entry field. We can then attach a Tag with a published
field to an App-related Entry in order to preserve such metadata; a simple approach that avoids losing relevant information when integrating Qlik and Data Catalog. The same reasoning applies to a variety of fields readable through the Qlik APIs.
Tags are created from templates, which can be thought of as their defining elements. Tag Templates are used to specify the names and types of the Tag fields that will actually hold metadata. The below code snippet (simplified to improve readability) brings a Tag Template intended to capture Apps’ metadata:
The code that creates Tag Templates currently used by the connector can be found here. You will notice there is a template for each Qlik asset type.
Additional templates are created for each Custom Property Definition’s Choice Value as long as they are assigned to at least one Stream or App:
The rationale behind this decision comprises allowing the connector to synchronize all metadata scraped from each Custom Property, at the same time it enables Qlik assets to be easily found by their custom properties — using query strings such as
tag:property_name:"<PROPERTY-NAME>"
andtag:value:"<SOME-VALUE>"
.
Lastly, there are relationships between the Qlik assets — most notably: parent-child, association, and composition. Although Data Catalog so far does not support relationships between Entities, the connector addresses some of them through Tags. In summary:
- Three Tag fields usually link assets to their parents (aka upstream): the parent’s Id, Name, and a URL that allows the user to navigate to the related Entry using the Data Catalog UI. Taking the above Tag Template as an example: a Stream, which is an App’s parent, is identified by the
stream_id
,stream_name
, andstream_entry
fields. If you take a look at the source code or run the connector, you will notice the same pattern is used in other templates to mimic resembling parent-child relationships. - Custom Properties can be seen as associations between their definitions and the Streams or Apps they help to classify. Such associations are represented by specific Tags.
- Compositions — e.g. which Dimensions, Measures, and Visualizations are used to build a given Sheet — are not addressed by the connector yet. This deserves a note, though: our experience with the Qlik Engine API let us believe it might be feasible adding such support in future versions.
THE SAMPLE CONNECTOR
Now it is time to look at some technical traits. The source code is intended to be clear enough so anyone can understand it. However, we are talking about software, and some matters are still worth clarifying. You will find some Qlik-specific quirks in the coming paragraphs; I’ll try to add as much context and meaningful links as possible, but one might still need to refer to the Qlik Sense for Developers Help for a deeper reading.
Components architecture
Six main components are used to settle the solution, as shown in the next diagram. Green boxes represent Qlik resources; gray boxes, connector specific resources; the blue box, a Data Catalog resource.
- Qlik Sense Proxy Service (QPS) API: a REST API used for authentication purposes. It issues tickets that can be used as HTTP cookies or headers to authenticate subsequent requests to the QRS and Engine APIs.
- Qlik Sense Repository Service (QRS) API: the connector reads Streams, Apps, Custom Property Definitions, and Custom Properties metadata from this REST API.
- Qlik Engine JSON API: a WebSockets API that allows the connector to read metadata related to the Sheets and Master Items — Dimensions, Measures, and Visualizations.
- google-datacatalog-qlik-connector: works as the orchestrator of the scrape → prepare → ingest workflow mentioned at the beginning of the present blog post. It contains the code that allows users to start the connector using the command line interface. But, much more than this, code that handles authentication, metadata reading (scrape stage), and Qlik to Data Catalog entity translating (prepare stage). Metadata ingestion (ingest stage) is delegated to the
google-datacatalog-connectors-commons
component. - google-datacatalog-connectors-commons: code shared by all Data Catalog open-source connectors, comprising full metadata ingestion support and execution monitoring tools.
- Data Catalog API: allows users to create and manage Entries for custom data resource types, as well as Tag Templates and Tags.
User authentication
The connector currently uses NTLM to authenticate the Qlik user, which is the default authentication module for Qlik Sense Enterprise on Windows. NTLM also allows users to connect from outside a trusted zone. Thus, starting with NTLM was a design decision to reach a higher audience since the initial version.
For those looking for implementation details: the main authentication logic is handled by the
Authenticator
class, including thex-Qlik-Xrfkey
header setup.
There are other options, though. After reading a lot of stuff in community forums we’ve noticed most programmatic interactions with Qlik Sense Enterprise on Windows seem to happen under certificate-based authentication instead of login/pass. JWT-based authentication seems to be achievable as well, but either option requires further investigation and testing.
REST vs WebSocket APIs
The components numbered 2 and 3 in the above list let it clear we use two APIs to read metadata throughout the scrape stage.
We tried to get as much metadata as possible from the Qlik Sense Repository Service (QRS) API. It implements a REST protocol, which is widely known by developers and easy to consume.
Please refer to the
RepositoryServicesAPIHelper
class to better understand what I mean by “easy to consume”.
The Qlik Engine JSON API, on the other hand, implements a WebSockets protocol and requires asynchronous programming. We took a while to get up to speed with it in the early days but then came to a solution that showed to be productive: BaseEngineAPIHelper
is the base class of all Engine API interactions; its methods promote code reuse, leaving to the child classes the responsibility of handling only specific stuff, as you can see in the EngineAPISheetsHelper
, EngineAPIDimensionsHelper
, EngineAPIMeasuresHelper
, and EngineAPIVisualizationsHelper
classes.
RESULTS
In the next picture, you can see the results of performing a catalog search after running the connector. I’ve used the system=qlik type=stream
query to look for all Streams available in a Qlik site in our testing environment.
Similar queries might be used to look for other asset types:
system=qlik type=app
system=qlik type=custom_property_definition
system=qlik type=sheet
type=dimension
type=measure
type=visualization
Clicking one of the Entries will bring up its metadata:
Please notice the example Stream is labeled with the GCPUser
Custom Property in Qlik Sense; hence its catalog Entry is also annotated with a GCPUser
-related Tag in Data Catalog (collapsed in Image 6). We can even look for all assets labeled with that same Custom Property, something users cannot achieve using only the Qlik Sense features. Let me change the query string to tag:property_name:GCPUser
to show how it works:
I can search by Custom Property value, too — e.g. tag:value:true
. There is a variety of search possibilities in Data Catalog, by the way. Please refer to the search syntax documentation for details.
Before closing this section, it is worth mentioning users can annotate Qlik assets with their own Tag Templates and Tags, leveraging Data Catalog to reinforce corporate Data Governance policies, for example.
WRAPPING UP
I’ve described challenges and decisions my team faced in the past few months when working on a solution to integrate Qlik Sense and Google Data Catalog. We have developed a sample connector that can be used to read metadata from Qlik Sense and store it in Data Catalog, making Qlik assets fully discoverable in Google Cloud.
The initial version handles 7 regular Qlik asset types, but there is always room for improvement. For example, we have investigated the possibility of parsing metadata from scripts and their underlying connections; this could result in elementary data lineage support and the connector would help to quickly answer questions such as “Which data sources are used by this App?” or “Which Qlik Apps would break if we drop this database table?”. Also, I believe it’s worth recapping the possibility of handling the compositions between the Master Items and Sheets: the required metadata seems to be available in the Qlik Engine API, but further researches are still needed. We didn’t have time to work on these topics so far…
By the way, in case you have suggestions or find bugs while running the connector, please don’t hesitate to file a feature request or issue. Feedback is always welcome!
Finally, I didn’t explore possible deployment solutions in this blog post but periodic executions can be easily scheduled given the connector ships with a command-line interface.
I hope you enjoyed the reading. Many thanks!