Enrichment and batch processing in Snowplow

A close look at Snowplow’s enrichment component as well as the deprecation of its batch pipeline

Jonathan Merlevede
datamindedbe
6 min readMar 18, 2020

--

An image of a snowplow plowing snow. Source: Jeroen Kransen on Flickr

When I was writing my Medium story “An introduction to Snowplow”, I decided not to discuss two interesting topics because they were out of scope of an introduction:

  • A thorough explanation of the enrichment procedure, out of scope on account of its complexity. How does enrichment, the most interesting part of Snowplow’s processing pipeline, work?
  • Batch processing implementations, out of scope on account of their deprecation. How did they work, and why were they deprecated?

This story takes a closer look at these two omissions.

I assume that you are familiar with the basic Snowplow processing pipeline. If you are not, check out my introduction to Snowplow:

The enrichment procedure

After data is ingested by the collector component, Snowplow’s enrichment component validates it, verifying that it is specified in a protocol that it understands. It then extracts event properties and enriches events. All of these operations together define Snowplow’s enrichment process. At the end of the enrichment process, events adhere to the Snowplow canonical event model. It is these canonical events that are sent to the storage component for storage.

Enrichment in action. Source: enriched from Wikipedia

There’s a lot to unpack here. The enrichment component is without a doubt the most complex part of the Snowplow processing pipeline, and is Snowplow’s only core component that you can customize with your own code. Let’s dive in!

Adapting HTTP requests to raw events

Enrichment starts with an adapter mapping raw HTTP GET and POST requests to one or more “raw events”. Raw events have properties and are associated with a HTTP context (i.e. the set of headers, referrer, …). Valid raw event properties are exactly the properties defined in the Snowplow tracker protocol.

  • As raw event properties map onto tracker protocol parameters (version 2, tp2), the adapter for the tracker protocol is trivial. For example, when a GET request &p=web arrives, it adapts it into a raw event containing a property p with the value web. Raw events always correspond only to a single event though, so when using batched POST requests, the tp2-adapter explodes a single request to the collector into multiple raw events.
  • Adapters for other protocols essentially map those protocols onto the tracker protocol.
  • The collector URL specifies what adapter to use. For example, events specified in the tracker protocol (version 2), created by vendor com.snowplowanalytics.snowplow, should be sent to https://collector.acme/com.snowplowanalytics.snowplow/tp2.
  • You can implement your own adapter as a microservice using the API Request enrichment. Unless you are doing this, there is no reason to know about raw events. Sorry!

Adapters map HTTP requests onto raw events, whose properties correspond to the tracker protocol’s parameters

Hard-coded enrichments

After adaptation, the enrichment component transforms raw events to canonical events by a sequence of hard-coded enrichments and configurable enrichments. Hard-coded enrichments cannot be turned off. They extract the “raw” properties and HTTP context into “canonical” event properties.

  • Usually the mapping is one-on-one. For example, a raw event with property p valued web will become a canonical event with the property platform valued web. Raw properties can also map onto multiple canonical properties. An example is the raw property res, that maps onto canonical properties dvce_screenwidth and dvce_screenheight. See the documentation of the tracker protocol and the documentation of hard-coded enrichments for all the mappings.
  • There are also properties extracted from the HTTP context, resulting in canonical properties such as user_ipaddress or page_referrer. These can usually be overridden, e.g. by defining the raw property refr.

Raw events are transformed into canonical events by a sequence of hard-coded and configurable enrichments

Configurable enrichments

Configurable enrichments are run after the hard-coded ones and have to be enabled through configuration. Some of these extract more information from the raw events, whereas others simply enrich already extracted properties.

  • A typical configurable enrichment is the IP lookups enrichment, which infers basic location data from the IP address that sent the message. Another interesting configurable enrichment is the pseudo-anonymisation enrichment, which helps with GDPR-compliance by obfuscating certain event properties. Find the full list of configurable enrichments here.
  • To enrich your data in your own custom way, use the generic JavaScript enrichment or the API Request enrichment. It is not recommended to plug your own enrichments directly into Snowplow’s Scala codebase.
  • The only way to find out in which order configurable enrichments are applied is to look at the source code. Snowplow-provided enrichments run in an order that makes everything work. Custom enrichments are run last, allowing you to override or further enrich already enriched properties.

What about validation?

No, I didn’t forget about validation! Although important and central to Snowplow’s value-add, validation is not actually a step but instead happens throughout the enrichment process. When the incoming data does not conform to a supported protocol or when any enrichment step fails, the enrichment process stops. The event is then discarded or written to a bad rows topic.

Validation happens throughout the enrichment process

Batch processing in Snowplow

The Snowplow batch processing pipeline has recently been deprecated, and there’s really not many reasons to get to know Snowplow’s batch components. Still, batch components are also just now being phased out, and there’s a number of references to these batch components in Snowplow’s documentation. If these confuse you, or if you are simply curious, keep on reading.

Snowplow’s batch processing components are deprecated

How do batch components work?

Let’s start by looking at some batch components and how they work. The most popular Snowplow collector might still be the Cloudfront web collector. This is simply a single 1x1 pixel GIF file hosted on Amazon Cloudfront with web access logging enabled, resulting in HTTP access logs on blob storage. The EmrEtlLogger implementation, which runs on Amazon EMR, parses these logs, enriches them and writes enriched events to a configurable destination such as S3. The reading, parsing and enriching of these logs, which contain “batches” of events, does not happen in real-time, but rather in “batch” fashion.

The Cloudfront collector does not use the Thrift format that defines the interface between collector and enrichment component in the streaming pipeline. The EmrEtlLogger also does not really fit into one of the “component categories” enrichment or storage, as it both enriches and stores events.

Ancient history

When it was conceived, Snowplow did all processing in batches. In 2015, some streaming processing components were added, and Snowplow evolved to have a so-called Lambda architecture. Consistent with Lambda architecture nomenclature, Snowplow’s processing components were partitioned into a speed layer and a batch layer. This terminology is still present in the Snowplow documentation, and some Snowplow batch components still exist and remain popular.

The one and only Cloudfront web collector

Recent history

Over the last couple of years, Snowplow’s development has focused on the speed layer. Although batch processing had advantages, simplicity and cost, these advantages were losing importance, whereas Snowplow’s ability to do processing in real-time is one of the things that sets it apart from its competition.

Summary

There has been an RFC to deprecate batch layer processing since July 2019. Batch collectors (the Clojure collector and Cloudfront collector) and Spark (batch) enrichment components were officially deprecated recently (in February 2020). The EmrEtlRunner will be deprecated soon. You could say that Snowplow is evolving towards having a Kappa architecture. In any case, clearly, when setting up a new processing pipeline batch components should be avoided.

Any mistakes or something you feel I missed? Let me know in the comments!

I work at Data Minded, an independent Belgian data analytics consultancy, and this is where I document and share my learnings from deploying Snowplow at Publiq.

Publiq is a non-profit organisation managing a database of activities in Flanders, Belgium. As part of an exciting project that will make Publiq more data-driven, we investigate using clickstream data to improve the quality of recommendations.

--

--