DataStax’s Snowflake Sink Connector for Apache Pulsar

Author: Andrey Yegorov

DataStax
Building Real-World, Real-Time AI
3 min readDec 6, 2021

--

Snowflake Sink Connector for Apache Pulsar provides a straightforward way to combine the messaging capabilities of Apache Pulsar with the powerful analytics of Snowflake by automatically and continuously ingesting messages from a Pulsar topic into a Snowflake table.

The connector wraps the Snowflake Kafka Connector, created by Snowflake, with Apache Pulsar’s Kafka Connect Adaptor (KCA) Sink. The Snowflake Kafka Connector provides access to code built and optimized by people familiar with the nuances of Snowflake. This is the first production-ready connector built with KCA Sink and requires Datastax Luna Streaming 2.8 or Apache Pulsar 2.9.

To get started, download the Snowflake Sink Connector from GitHub.

Let’s try the connector

Now let’s look at how to get the connector up and running in the local test environment. I’ll assume that MacOS is being used, but all the commands are easy to adapt for other operating systems.

1. Setup Snowflake

Whether you already have a Snowflake instance or just signed up for a trial, you can access it using a URL that looks like “tenant_id.snowflakecomputing.com”. Remember this URL, you’ll need it for the configuration later.

First, let’s create public and private keys for use with the connector:

Get the public key:

Get the private key:

Next, we’ll create a new database by following steps from the Snowflake Documentation:

  1. Sign in to the Snowflake instance you want to use for the test under the admin account.
  2. Navigate to the Worksheets.
  3. Execute the following commands as a user with sysadmin and securityadmin privileges. Don’t forget to check “[x] All Queries” at the top of the worksheet:

Note: You’ll need to modify ‘PASTE_PUBLIC_KEY_FROM_STEPS_ABOVE’on line 42 of the code snippet for your environment.

2. Prepare config for the connector

Note that batch sizes, buffers, linger time, etc. are reduced to see the outcome of the test faster. You may want to tune these for production use.

Note: You’ll need to modify tenant_id on line 17 and ‘PASTE_PUBLIC_KEY_FROM_STEPS_ABOVE’ on line 19 for your environment.

The Snowflake Sink Connector supports all of the Snowflake Kafka Connector’s configuration properties.

3. Prepare Apache Pulsar’s topic

Start Apache Pulsar locally:

Set retention for the namespace:

Set topic schema:

4. Download and start the connector

5. Send messages to Pulsar

Now the topic contains 200 messages with a given JSON schema. You can preview the messages in the topic with the following command:

6. Check the messages in Snowflake

Log in to your instance of Snowflake as pulsar_connector_user_1 with the password configured at the Snowflake setup step earlier.

A new table should appear under the configured DB and schema:

Figure 1: New table in Pulsar.

Preview the data to see that the messages have been ingested by Snowflake:

Figure 2: Ingested data by Snowflake.

It may take a couple of minutes for all the messages to get there.

As you can see, setting up and running the connector is easy. If you have any questions about Pulsar or the Snowflake connector, contact DataStax for a free consultation and our team will get you right on track.

Follow DataStax on Medium for exclusive posts on all things Pulsar, Cassandra, streaming, Kubernetes, and more. To join a buzzing community of developers from around the world and stay in the data loop, follow DataStaxDevs on Twitter.

Resources

  1. GitHub: Snowflake Connector
  2. GitHub: Snowflake Kafka Connect Connector
  3. Simplify migrating from Kafka to Pulsar with Kafka Connect Support
  4. DataStax Luna Streaming — Apache Pulsar Distribution
  5. Snowflake Connector for Kafka — Snowflake Documentation
  6. Installing and Configuring the Kafka Connector — Snowflake Documentation
  7. Four Reasons Why Apache Pulsar is Essential to the Modern Data Stack

--

--

DataStax
Building Real-World, Real-Time AI

DataStax provides the real-time vector data tools that generative AI apps need, with seamless integration with developers' stacks of choice.