Interactive Analytics on Azure Event Hub with Trino

How Trino enables you to run interactive SQL queries on top of Azure Event Hub streaming data

Patrick Pichler
Creative Data
6 min readApr 21, 2022

--

Example Architecture (Image by Author)

Introduction

Modern distributed messaging systems such as Apache Kafka or Microsoft Azure Event Hubs have quite matured over the last few years. They build upon distributed commit logs and allow you to publish and consume data in real-time in a very efficient and distributed manner. This design pattern is mostly used for streaming data into other downstream systems to enable low-latency follow-up analysis or design microservices or serverless architectures. This makes messaging systems act as a kind of middle layer to decouple data producer from data consumer systems.

While they serve very well for those use-cases, they aren’t really designed to deal with direct queries or to run heavy computations such as joins or aggregations. Yet, as they are getting increasingly integrated in today's data architectures, the demand for such database-like functionalities is rising. To facilitate this, some purpose-built technologies have also already been introduced such as ksqlDB for Apache Kafka. Though, those tools are mostly tied to its own ecosystem and don’t allow you to query data across disparate data sources. This is where Trino comes into play. In this article, we will explore how Trino enables fetching live data from Azure Event Hubs and allows joins across any other data source supported by Trino using simple SQL queries. Results can then be loaded into Power BI for further analysis or persisted to a cold storage for archiving purposes. This is especially interesting since most of the cloud based messaging services only support a few number of days as a maximum retention period before messages get dropped. Without further ado, lets get started.

1. Setup

  1. Docker Desktop on Windows or Docker Engine on Linux
  2. Cloning or downloading the Kafka connector example from the Trino Getting Started series
  3. Azure Subscription including a Azure Event Hub namespace in at least standard tier for supporting the Kafka endpoint which is enabled by default
  4. Python runtime for data generation

2. Configuring Trino’s Kafka connector

The provided Kafka connector example includes everything required to locally load data into Kafka topics and query live data with Trino. However, since we take this tutorial just as a starting point to query our Azure Event Hub, we first need to make some configuration adjustments to get everything working.

First of all, we need to make the Azure Event Hub known to Trino which is defined in a catalog properties file etc/catalog/kafka.properties. You can take the following code snippet and make your adjustments:

connector.name=kafka
kafka.nodes=EVENTHUBNAMESPACENAME.servicebus.windows.net:9093
kafka.table-names=EVENTHUBNAMES

kafka.hide-internal-columns=false
kafka.config.resources=PATHCONFIG

EVENTHUBNAMESPACE: The name of the created Azure Event Hub namespace. Example: trinoazureeventhub

EVENTHUBNAMES: The name of the Azure Event Hubs you want to query, multiple names can be listed separated by a comma. If you name a non-existent Azure Event Hub, then this will be created automatically thanks to Kafka’s auto.create.topics.enable configuration which we will do in this example. The prefix defines the schema in which the Azure Event Hub will occur in Trino and the catalog/database is derived from the config file name, in our case this will be kafka (kafka.properties). Each Azure Event Hub will be represented as a single table in Trino. Example: azure.trinoazureeventhub

PATHCONFIG: Path to reference another Kafka config file which we need to create manually for providing further Kafka properties. This is required for the SASL authentication. Example: etc/catalog/kafka-configuration.properties

You can copy the following code snippet and add it as a content to your just created and referenced file and make your adjustments.

security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
username=
"$ConnectionString" \
password=
"EVENTHUBSCONNECTIONSTRING";

EVENTHUBSCONNECTIONSTRING: A connection string from the Azure Event Hub namespace which includes the name of the shared access policy and its key value, among other information. Example: Endpoint=sb://mynamespace.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=XXXXXXXXXXXXXXX

3. Configuring Trino’s Kafka connector description file

As you might know, data in messaging systems is basically unstructured and self contained. If you haven't configured any schema registry, it has also no metadata to describe the format of the messages. Nevertheless, Trino’s Kafka connector would already be able to query data in its raw form holding the actual data in a column named “_message” among other Kafka specific columns. However, we want to have the messages decoded into separated columns which can be achieved in two ways. We can either create a view and parse the message on the fly by using the json_extract_scalar() function or we can tell Trino how to parse the message data upfront by providing it a description file.

In this example, we will further down feed our Azure Event Hub with messages in JSON format. We will now tell Trino the structure of those messages to be able to automatically transform the data into a table format. This description file needs to be located in the etc/kafka folder and must end with .json. One file for each table and it is recommended that the file name matches the table name, but this is not necessary. For our use case, we will create a file named azure.trinoazureeventhub.json with the following content:

{
“tableName”: “trinoazureeventhub”,
“schemaName”: “azure”,
“topicName”: “azure.trinoazureeventhub”,
“message”: {
“dataFormat”: “json”,
“fields”: [
{
“name”: “deviceId”,
“mapping”: “deviceId”,
“type”: “VARCHAR”
},
{
“name”: “temperature”,
“mapping”: “temperature”,
“type”: “DOUBLE”
},
{
“name”: “pressure”,
“mapping”: “pressure”,
“type”: “DOUBLE”
},
{
“name”: “ts”,
“mapping”: “ts”,
“type”: “TIMESTAMP”,
“dataFormat”: “custom-date-time”,
“formatHint”: “yyyy-MM-dd HH:mm:ss.SSS”
},
{
“name”: “source”,
“mapping”: “source”,
“type”: “VARCHAR”
}
]
}
}

4. Starting up Trino and its CLI

Now we are ready to finally run the docker-compose command as outlined in the tutorial. The Kafka specific services will fail, don't worry, this is due to our configuration changes. It’s just the Trino coordinator which is required for us to go on since this actually runs Trino including its CLI.

Once we are connected to the it, the CLI allows us to run interactive SQL queries in Trino. If you have just created the Azure Event Hub namespace without an Azure Event Hub itself, then running the first query would automatically create one for you — as mentioned above without returning any data yet. For instance, the query for the given example would look like this:

SELECT * FROM kafka.azure.trinoazureeventhub;

5. Publishing data to Azure Event Hub

To finally test our setup, we will use the following Python code snippet to generate and publish fictional telemetry data in batches to our just created Azure Event Hub. It just requires the module azure-eventhub to be installed as well as the adjustment of the variables connection_str and eventhub_name before starting it.

Azure Event Hub Data Generator Snippet (by Author)

6. Querying and analyzing data

If everything is set up correctly and data is available in the Azure Event Hub, we should finally be able to receive the data in the correct format by triggering the query again via the CLI.

Example SQL Query Azure Event Hub Trino (by Author)

Since the Azure Event Hub is now available in Trino like any other table, we could also load data further into Power BI. For this example, I have used my self-developed and open-sourced custom connector for Trino available here.

Conclusion

We have gone through a little working example proofing how Trino could be leveraged to easily join multiple streaming data sources with static data coming from somewhere else in a single SQL statement. This can open up lots of new opportunities besides giving you an insight about offsets and, in general, what’s currently going on in your Azure Event Hub. This pattern should also work for any other messaging platform providing an endpoint compatible with the Apache Kafka APIs.

Resources

--

--

Patrick Pichler
Creative Data

Promoting sustainable data and AI strategies through open data architectures.