Ingesting custom event sources with Snowplow

How to use Snowplow to ingest data from an unsupported source, such as Auth0’s log streaming service.

Jonathan Merlevede
Nov 24, 2020 · 10 min read
Image for post
Image for post
It can be tricky to get Snowplow to ingest unsupported data. Photo by Tamas Pap on Unsplash

Snowplow comes with a great set of tracking SDKs which construct and fire events for ingestion by your Snowplow backend. But what if you want to ingest events generated by an external service, where you do not control how data is formatted?

This story covers how ingest data from Auth0’s log streaming service, an event source not natively supported by Snowplow. The first part discusses how Snowplow event ingestion works, and possible ways to extend its capabilities. The second part details how to use a forwarding adapter to ingest data from any source.

Ingesting supported event types

If you know about Snowplow, you’re probably familiar with its “processing pipeline” and the figure below:

Image for post
Image for post
Snowplow processing pipeline (taken from https://github.com/snowplow/snowplow)

As illustrated by this figure, Snowplow supports receiving events from its own trackers (1a on the figure). Snowplow also allows the collection of events of supported third-party software, where the Snowplow collector is configured as a webhook (1b). How does this process work?

On a high level, processing steps 1 through 4 of the chain work as follows:

  1. An event is generated, either by a Snowplow tracker or in a third-party application. The tracker or third-party application submits an HTTP request to the collector component containing event data.
  2. Without looking at the contents of the request, the collector component passes the request on to the enrichment component. The request can be a GET or a POST request; the collector does not care.
  3. The enrichment component maps the path part of the request URL onto an adapter, which transforms the raw request data onto one or more raw (Snowplow) events. The enrichment component then enriches these raw events, and passes the enriched raw events on to the storage component. If you’re interested in a somewhat more in-depth overview of the enrichment component, check out my other story.
  4. The storage component stores the raw events in your database or data warehouse, such as Redshift of BigQuery.

For the purposes of this story, we’re interested in the third step, and particularly the part indicated in bold — the mapping of the raw request onto raw events.

An adapter transforms the raw HTTP request onto one or multiple raw Snowplow events.

Snowplow Trackers send data to the following URL:

https://${COLLECTOR_HOST}/com.snowplowanalytics.snowplow/tp2

Requests to this URL have to adhere to the Snowplow tracker protocol (version 2), and are transformed into raw events by the TP2Adapter. The TP2Adapter accepts both GET and POST requests. A single POST request can describe multiple raw events. We’ll get back to this later in this story.

All in all, the TP2Adapter is simple, as it turns out that the properties of Snowplow “raw events” are pretty much the same as the properties specified in the tracking protocol. This is good to know, as the format of raw events is not really documented.

Support for other event types is implemented similarly. For example, if you want to ingest Mailchimp data you configure the following callback URL in MailChimp:

https://${COLLECTOR_HOST}/com.mailchimp/v1

Snowplow then uses MailchimpAdapter to transform the raw request to raw events. Snowplow knows which adapter to use because the mapping of paths onto adapter classes is encoded in AdapterRegistry. The registry and mapping is hardcoded in the Snowplow common enrich source code (see below).

Excerpt from the AdapterRegistry class definition

Ingesting unsupported event types

We now know how Snowplow ingests data from its trackers as well as supported third-party software. But what about events from unsupported third-party software? Say you want to ingest Auth0 log events, which are not a natively supported event type.

One obvious way to extend Snowplow’s functionality is to change Snowplow’s source code. To add a custom adapter, you need to add a line in the AdapterRegistry class and implement a single adapter class in Scala. Certainly not impossible.

Going this route unfortunately has a significant downside: you’ll need to build the enrichment component from source, and verify that your changes are not conflicting with upstream changes every time you update the enrichment component. Although you can avoid this by creating a pull request and getting your changes merged into Snowplow’s codebase, this takes time, and requires that the event that you are trying to ingest is of public interest.

If you are e.g. Auth0 or another vendor who want their events to be supported by Snowplow, this is the way to go. If you are a third party, I suggest to think twice before going down this route.

Contributing a Snowplow adapter is not likely to be on the shortest path to ingestion of custom events.

The remote HTTP adapter, introduced in Snowplow R114, was specifically designed to ingest unsupported event types. The remote HTTP adapter is a configurable adapter allowing you to adapt events by developing an HTTP endpoint.

To use the remote adapter, you have to write a configuration file mapping (collector) request paths onto remote HTTP endpoints. The remote adapter then registers itself in AdapterRegistry as the handler for all the configured request paths (you can see this in the AdapterRegistry.scala excerpt I included above). When the remote HTTP adapter is asked to process a request, it ships it off to the remote HTTP endpoint, which converts it into raw events.

Unfortunately, the remote adapter has some significant downsides:

  • It is not currently supported by Beam Enrich, and there is no ongoing effort to support it. If you’re running the GCP stack with Beam Enrich, this means you simply can’t use it.
  • Documentation on both the remote adapter and the format of raw events is lacking.
  • The Snowplow enrichment component becomes dependent on the external HTTP endpoint; if the HTTP endpoint goes down, ingestion of other event types might also be affected.
  • The remote HTTP adapter might be deprecated somewhere in the future, because its usage is somewhat controversial (because of the reasons above?) and I am asusming because it is also not very popular. It is certainly not part of the “core” Snowplow stack.

The remote adapter has some significant downsides; I cannot recommend its usage.

Instead of configuring Snowplow to contact a remote HTTP service to help it process events that it can’t process on its own, we can point the external service (in my case, Auth0) to an endpoint of our own making. This endpoint transforms the request into a request that adhers to Snowplow’s tracking protocol, and forwards it to the Snowplow collector. That’s it!

  • No special Snowplow configuration required.
  • You’re only using the most basic Snowplow features and are programming against it’s most central API; this approach unlikely to break.
  • If the forwarding service goes down, your other events remain unaffected.

Implementation

So, we’ve figured out that we want to use a forwarding HTTP adapter to ingest our custom events. Let’s go into more detail on how to do thisin general, and how to ingest Auth0 log events in particular.

When ingesting a new event source, you will generally start by defining a new self-describing (also known as unstructured) event type. An example Auth0 log stream event payload look as follows:

So, we see that

  • Events are sent in micro-batches; a single request generally contains an array of events. How many events are batched together likely depends on the load on your Auth0 tenant, but that’s not important here.
  • Every event has a log_id, date, type, description, client_id, client_name, ip, user_agent and user_id property.

Since some of these properties (e.g. client_id) do not map onto Snowplow atomic properties, we define a custom self-describing event type. We could use structured events, but unstructured events are generally the better option as they’re much more expressive.

In conclusion, add a schema similar to the following to your schema registry:

{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for an auth0 log entry",
"self": {
"vendor": "com.acme",
"name": "auth0_log",
"format": "jsonschema",
"version": "1-0-0"
},
"type": "object",
"properties": {
"log_id": {
"type": "string",
"description": "Log id",
"maxLength": 100
},
"type": {
"type": "string",
"description": "Log type",
"maxLength": 100
},
"description": {
"type": "string",
"description": "Log description",
"maxLength": 10000
},
"client_id": {
"type": "string",
"description": "Client id",
"maxLength": 32
},
"client_name": {
"type": "string",
"description": "Client name",
"maxLength": 100
}
}
}

Some remarks:

  • Note that certain log entry properties are part of the atomic event model (ip, user_agent and date). We can include these properties as part of the self-describing event properties, but I suggest not to. Firstly, when encoding ip, user_agent and date as a self-describing event property, the resulting events will either have conflicting or duplicate values for these properties, as the atomic event properties will still be attached to every event. Secondly, using native data fields means they will go through the Snowplow enrichment process, implying that Snowplow will apply configured enrichments (e.g. pseudoanonymise the IP address, derive geo-information from the IP, …).
  • I do not go into detail on how to set up a registry or how to configure enrich to look for schema definitions in your registry here. Feel free to comment or contact me if you are having trouble setting this up.
  • Although valid values for Auth0 log event type codes are defined (see here), I chose not to include possible values as part of the schema, as I do not want to have to update the schema when the list of valid values changes.
  • I defined maxLength to get rid of igluctl’s linter warnings, but did not pay a lot of attention to the value of this property. My target datastore is BigQuery, for which this length property is meaningless.

The default Snowplow tracking endpoint accepts both GET requests and POST requests. A GET request describes a single event, and includes data as query parameters. A POST request can describe multiple events, and includes data as JSON in the request’s body. Since our forwarding HTTP adapter receives lists of events as inputs, it itself should generate POST requests so that it generates only a single outgoing request for every incoming request.

The format of valid Tracker Protocol 2 requests payload is itself defined by an Iglu JSON schema, payload_data, which is available in Iglu central here. The meaning of all of the payload properties is defined in the Snowplow Tracker Protocol.

I opted to implement my HTTP service using Typescript and NodeJS and to deploy it as a Cloud Function to GCP, but any language or deployment model will do the trick. The HTTP service is easy to write; its job is simple (~source of main function):

  1. Accept the request
  2. Convert the Auth0 log events to the Tracker Protocol
  3. Envelop all the log data in a payload_data envelope
  4. Send the adapted event to the snowplow collector

Instead of constructing HTTP requests to the collector from scratch, you may also consider using a Snowplow tracker/tracking SDK, such as the Node.js tracker, and have it construct the events for you. The choice is up to you!

Converting Auth0 log events to Tracker Protocol events can be done as follows:

As discussed, we set the user id (uid), user agent (ua), IP address (ip) and event time (ttm) as Snowplow event properties, and not as properties of the custom unstructured event. Setting these properties explicitly prevents the enricher from deriving them from e.g. the IP address from which the requests originate. We also set the event ID (eid), as it is recommended to set a unique event ID in the client. Lastly, note that the value of ue_pr (“unstructured event properties”) is not an object, but a string!

Enveloping the events in a payload_data envelope can be done as follows:

All that remains is to deploy the service and point the Auth0 streaming logs webhook to our newly deployed service. Do not point it to the Snowplow collector, as it won’t know how to ingest the data!

Concluding remarks

Ingesting custom events in Snowplow is not as easy as it maybe should be. The mechanism used to ingest events from custom services is not very extendable. To ingest custom event types in the same way as the natively supported types, you have to dive into (Scala) source code and compile your own enrichment component. Although there is a built-in mechanism for implementing “remote HTTP adapters”, I’d suggest to avoid it. If you’re using Beam Enrich, using remote adapters is completely unsupported. Although it look like Snowplow Analytics might increase the variety of events that can be ingested without much configuration, for example by supporting events specified by the CloudEvents spec, the list of natively supported event types is currently limited.

Ingesting custom events in Snowplow is not as easy as it maybe should be.

To ingest unsupported event types, using a “forwarding HTTP adapter” as outlined in this story is a decent, generic alternative to using remote adapters. Creating a forwarding HTTP adapter is relatively easy, even though implementing one does require some familiarity with the Snowplow Tracker Protocol spec. Nevertheless, writing an Iglu schema and an HTTP forwarding adapter is likely easier than writing a service to ingest data into your storage solution directly.

Think about why you want to use Snowplow for ingestion, and weigh it against alternative approaches.

You should definitely consider whether ingestion through Snowplow is something you want. An alternative for me would have been to write the Cloud Function to perform a streaming insert direclty into a BigQuery table.

An important factor when deliberating is probably whether you already use Snowplow for other purposes. Possible advantages of using Snowplow are if your warehouse does not support streaming inserts, if you use Snowplow enrichments such as IP lookups or IP anonymization, or if you appreciative of Snowplow’s data consistency checks and semantic data model versioning (SchemaVer). An obvious downside compared to more direct ingestion methods are the number of moving parts, and dependency on Snowplow.

I work at Data Minded, an independent Belgian data engineering consultancy. Contact us if you’d be interested in working with us!

Image for post
Image for post

Better data engineering

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store