How to start playing with Snowplow?

Published in

Etnetera Activate

4 min readMay 28, 2019

There are many ways how you can start with Snowplow. But most of them have quite difficult learning curve. There are just too many options and combinations of Snowplow workers and cloud services available.

First important decision to make is to choose between running Snowplow on the Amazon Web Services (AWS) or Google Cloud Platform (GCP). This choice has many implications, and when you want just explore the platform’s capabilities, you may have no idea, which platform would suit your particular needs in the future. For the real use cases, selecting the right cloud platform is pretty crucial as Snowplow uses it for a lot of tasks, not just as a virtual machine provider.

In GCP environment Snowplow uses Google Pub/Sub for pipelining messages through the whole process and Google Dataflow for enriching and putting the data into the final database. At the end of the process the data typically ends up in BigQuery. The support for GCP is pretty new in the Snowplow ecosystem, it means that many components are not very mature yet.

Snowplow for Google Cloud Platform is here - Snowplow

Since the early days of Snowplow Analytics, we've been committed to giving our users very granular, highly structured…

snowplowanalytics.com

This is different from the AWS ecosystem, which is supported since the foundation of this solution. There is S3 for data collection, Elastic Map Reduce for the ETL process and the data will be finally stored in the Redshift database.

Also the final format of the data is different. In Redshift, you will get many connected tables, and you have to deal with DDL for adding new schemas. In BigQuery you get one huge table and adding new schema is automated using BigQuery mutator.

The same situation applies to particular building blocks of the pipeline.

On the top of that, there are configurations and other deployment tasks which require loads of work in the console. You may miss what’s important if you just want to grasp the basic concepts. Remember that you want to start playing with your own data schemas, data enrichments and modelling, not to do the work of Linux Admin.

How to avoid this overhead? Use Snowplow mini!

Fortunately, there is Snowplow Mini:

snowplow/snowplow-mini

An easily-deployable, single-instance version of Snowplow - snowplow/snowplow-mini

github.com

It’s a virtual machine image that contains what you need to start playing with Snowplow.

Snowplow data processing chain — Collector and Stream Enricher
Elastic Search for message pipelining
Kibana web interface for data exploration — in real-time, you can see good requests, bad requests, enriched data, etc. and you can use that for creating fancy visualisations
PostgreSQL for SQL data modelling
Integrated IGLU server to host your own data schemas
Simple web interface to control everything

The installation is fast — it’s only one virtual machine based on a ready-made image. You can deploy that into Google Cloud, AWS or any other solution (i.e. Digital Ocean, etc.). With a decent virtualisation setup you could probably run this image also on a local machine.

It is also pretty cheap. I used the a n1-standard-1 machine on GCP that was fast enough and it ran roughly for 20USD/month for my own experiments.

It would probably be a terrible idea to use Snowplow Mini for production data. But it’s optimal for development and testing environments. In just five minutes you can start sending requests and experiment with data visualisation and modelling. And you can quickly clean up everything in case something went wrong.

An integrated IGLU server allows you to develop your own data schemas, and test them quickly. Wheee.

With this you can focus on data analytics, not on cloud service configuration.

How to deploy Snowplow to Google cloud in 20 minutes

Have you heard about Snowplow? Do you want to try it, but installation looks too complicated? I have great news for…

medium.com

How to analyze raw Adobe Analytics data in BigQuery

Adobe Analytics is a great tool and we love to implement and use it. You can collect consistent data from all relevant…