DataStax’s Snowflake Sink Connector for Apache Pulsar

Snowflake Sink Connector for Apache Pulsar provides a straightforward way to combine the messaging capabilities of Apache Pulsar with the powerful analytics of Snowflake by automatically and continuously ingesting messages from a Pulsar topic into a Snowflake table.

The connector wraps the Snowflake Kafka Connector, created by Snowflake, with Apache Pulsar’s Kafka Connect Adaptor (KCA) Sink. The Snowflake Kafka Connector provides access to code built and optimized by people familiar with the nuances of Snowflake. This is the first production-ready connector built with KCA Sink and requires Datastax Luna Streaming 2.8 or Apache Pulsar 2.9.

To get started, download the Snowflake Sink Connector from GitHub.

Let’s try the connector

Now let’s look at how to get the connector up and running in the local test environment. I’ll assume that MacOS is being used, but all the commands are easy to adapt for other operating systems.

1. Setup Snowflake

Whether you already have a Snowflake instance or just signed up for a trial, you can access it using a URL that looks like “tenant_id.snowflakecomputing.com”. Remember this URL, you’ll need it for the configuration later.

First, let’s create public and private keys for use with the connector:

Get the public key:

Get the private key:

Next, we’ll create a new database by following steps from the Snowflake Documentation:

  1. Sign in to the Snowflake instance you want to use for the test under the admin account.
  2. Navigate to the Worksheets.
  3. Execute the following commands as a user with sysadmin and securityadmin privileges. Don’t forget to check “[x] All Queries” at the top of the worksheet:

Note: You’ll need to modify ‘PASTE_PUBLIC_KEY_FROM_STEPS_ABOVE’on line 42 of the code snippet for your environment.

2. Prepare config for the connector

Note that batch sizes, buffers, linger time, etc. are reduced to see the outcome of the test faster. You may want to tune these for production use.

Note: You’ll need to modify tenant_id on line 17 and ‘PASTE_PUBLIC_KEY_FROM_STEPS_ABOVE’ on line 19 for your environment.

The Snowflake Sink Connector supports all of the Snowflake Kafka Connector’s configuration properties.

3. Prepare Apache Pulsar’s topic

Start Apache Pulsar locally:

Set retention for the namespace:

Set topic schema:

4. Download and start the connector

5. Send messages to Pulsar

Now the topic contains 200 messages with a given JSON schema. You can preview the messages in the topic with the following command:

6. Check the messages in Snowflake

Log in to your instance of Snowflake as pulsar_connector_user_1 with the password configured at the Snowflake setup step earlier.

A new table should appear under the configured DB and schema:

Figure 1: New table in Pulsar.

Preview the data to see that the messages have been ingested by Snowflake:

Figure 2: Ingested data by Snowflake.

It may take a couple of minutes for all the messages to get there.

As you can see, setting up and running the connector is easy. If you have any questions about Pulsar or the Snowflake connector, contact DataStax for a free consultation and our team will get you right on track.

Follow DataStax on Medium for exclusive posts on all things Pulsar, Cassandra, streaming, Kubernetes, and more. To join a buzzing community of developers from around the world and stay in the data loop, follow DataStaxDevs on Twitter.

Resources

  1. GitHub: Snowflake Connector
  2. GitHub: Snowflake Kafka Connect Connector
  3. Simplify migrating from Kafka to Pulsar with Kafka Connect Support
  4. DataStax Luna Streaming — Apache Pulsar Distribution
  5. Snowflake Connector for Kafka — Snowflake Documentation
  6. Installing and Configuring the Kafka Connector — Snowflake Documentation
  7. Four Reasons Why Apache Pulsar is Essential to the Modern Data Stack

--

--

--

We’re huge believers in modern, cloud native technologies like Kubernetes; we are making Cassandra ready for millions of developers through simple APIs; and we are committed to delivering the industry’s first and only open, multi-cloud serverless database: DataStax Astra DB.

Recommended from Medium

Top 12 Cloud-based Document Management Systems That Will Increase Your Efficiency

Jetpack Compose: building a generic grid canvas

Writing quality code under time pressure?

Data structures exercise: LRU cache in Java, with TDD

Hell ‘th.. I mean, Health.

The Full-Stack Developer Vocabulary (Pt.1, the essentials)

Python — Beautiful Soup All Notes with Projects

Ten Commandments of Git Commit Messages

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.

More from Medium

Setting up Apache Airflow on Kubernetes with GitSync

TiDB: The Best and Sexy Distributed SQL Database Platform for Horizontal Scalability, Strong…

Scaling Applications on Kubernetes with Ray

Apache Pulsar Performance Testing with NoSQLBench