Connecting to Kafka from SQL Stream Builder on CDP Public Cloud

Ferenc Csaky
Cloudera
Published in
5 min readOct 20, 2021

Data can come from plenty of places — or sources if you like. Well, in our case that source is strictly Kafka, although one can still have options where to host a Kafka installation.

In this post, I will show how to add your Kafka service as a data provider to our SQL Stream Builder (SSB) [1] service on CDP (Cloudera Data Platform) Public Cloud [2]. To be specific, I will go through the following Kafka service providers in the following order:

  • CDP Kafka [3]
  • Confluent Kafka [4]
  • Amazon MSK [5]

The SSB service is part of the Streaming Analytics template and it was introduced to Public Cloud in the 7.2.11 Cloudera runtime. As a starting point, it is required to have a Streaming Analytics cluster in a CDP environment, which is at least on the 7.2.11 version. With that covered let’s dive into the Kafka part.

Connect to CDP Kafka

The most obvious way is to hook in a Kafka service from CDP. For the sake of completeness, the Streaming Analytics cluster template also contains a built-in Kafka service, but that should be restricted to the SSB sampling logic, we do not recommend, nor support it for production use. This means it is required to have another Streams Messaging [3] cluster in the same environment and use that as a Kafka data source.

The streams messaging cluster broker configuration can be found on the Hardware tab on the CDP cluster page:

CDP Streams Messaging cluster broker FQDNs.
CDP Streams Messaging cluster broker FQDNs.

We will need the Core_broker FQDN-s, which can be copied from this page. Since CDP Public Cloud is secured by design, we will use the 9093 port. In terms of authentication, there are two possibilities:

  • authenticate via Kerberos, or
  • authenticate via username and password.

The two methods only differ in configuration, so the first few steps will be the same:

  1. Navigate to the “Data Providers” page from the sidebar.
  2. Click to “Register Kafka Provider” link on the right side.

Now, a modal named “Add Kafka Provider” popped up with the Kafka connection-related properties.

Auth via Kerberos

Clusters in the same environment are on the same Kerberos realm, so one can use the same Kerberos principal between them, which will be resolved in the background without any explicit configuration. After we give a name to the new provider and paste the brokers as <host>:<port> in a comma-separated list, we only need to pick SASL/SSL as the connection protocol and KERBEROS as SASL mechanism. It is possible to pass a Kafka TrustStore, but in CDP Public Cloud that is auto-configured, so we can leave it empty.

Adding CDP Kafka data provider with Kerberos authentication.

After we set the connection configuration, nothing is left just to hit “Save”. And that is it, if the brokers and the passed credentials are correct we can start to use the newly added Kafka provider. The following steps show how to read a topic:

  1. Go back to “Console”.
  2. Click to the “Tables” tab and hit “Add Table” -> “Apache Kafka”.
  3. Select the newly added provider as “Kafka Cluster”. (If everything is okay, the already existing topics will be listed under the “Topic Name” selector.)
  4. Add a table name.
  5. Click “Detect Schema” to fetch the message schema automatically.
  6. Click “Save Changes”.
Selecting a topic which belongs to the newly added Kafka data provider.

For more detailed information about the Kafka table related configuration, please read through the documentation [6]. On how to run an SQL job, which reads from that topic, please read the corresponding part of the SSB documentation [7].

The Kafka provider interaction goes the same for every other Kafka service as well, so the next chapters will not include explicit examples about that.

Auth via Username/Password

It is also possible to connect to a cluster, which runs in a different environment if that other environment is on the same network as the one that has SSB. This requires a username/password authentication, where username should be your CDP username and password is your CDP workload password. The CDP Public Cloud Kafka clusters are using the PLAIN mechanism over SASL/SSL and authenticates via PAM by default.

Adding CDP Kafka data provider with User/Password authentication.

Confluent Kafka

Confluent Kafka is pretty much the same as connecting to a CDP Kafka cluster via Username/Password. At this moment Confluent Kafka supports only the PLAIN mechanism over SASL/SSL. After pasting the broker list and giving the credentials it should look something like this:

Adding Confluent Kafka data provider.

Be aware that the given Confluent Kafka service has to be available over the public internet.

Amazon MSK

Amazon MSK provides some options for authentication methods, but for now SSB supports SASL/SSL in this case as well — and PLAINTEXT, of course. In case the Streaming Analytics cluster is hosted on AWS as well make sure that you create the Kafka cluster on the same network (and subnets) to be sure it will be accessible. While creating the cluster make sure you select SASL/SCRAM under “Security settings”:

AWS Kafka cluster security settings.

Be aware that if SASL/SCRAM is enabled, then unauthenticated access will not work, although it possible to enable it. After the Kafka cluster is created, it is also required to associate a secret to it. More information about that can be found in the AWS documentation [8].

From the SSB side, the only difference in adding an Amazon MSK provider is that it requires the SCRAM-SHA-512 mechanism:

Adding Amazon MSK data provider.

And that sums it up quite nicely. It is not really cumbersome to work with any main Kafka service even outside Cloudera, right? Now it is up to you to define what is possible with this technology. SQL over streaming data rocks!

Reference

[1] SQL Stream builder (SSB) documentation.
[2] CDP Public Cloud documentation.
[3] CDP Kafka (Streams Messaging) documentation.
[4] Confluent Kafka “Getting Started”.
[5] Amazon MSK “Getting Started”.
[6] Creating Kafka tables in SSB.
[7] Running simple SQL jobs via SSB Console.
[8] Create username/password secret in AWS.

--

--