How to start playing with Snowplow?

Jiri Stepan
Etnetera Activate
Published in
4 min readMay 28, 2019

There are many ways how you can start with Snowplow. But most of them have quite difficult learning curve. There are just too many options and combinations of Snowplow workers and cloud services available.

First important decision to make is to choose between running Snowplow on the Amazon Web Services (AWS) or Google Cloud Platform (GCP). This choice has many implications, and when you want just explore the platform’s capabilities, you may have no idea, which platform would suit your particular needs in the future. For the real use cases, selecting the right cloud platform is pretty crucial as Snowplow uses it for a lot of tasks, not just as a virtual machine provider.

In GCP environment Snowplow uses Google Pub/Sub for pipelining messages through the whole process and Google Dataflow for enriching and putting the data into the final database. At the end of the process the data typically ends up in BigQuery. The support for GCP is pretty new in the Snowplow ecosystem, it means that many components are not very mature yet.

This is different from the AWS ecosystem, which is supported since the foundation of this solution. There is S3 for data collection, Elastic Map Reduce for the ETL process and the data will be finally stored in the Redshift database.

Also the final format of the data is different. In Redshift, you will get many connected tables, and you have to deal with DDL for adding new schemas. In BigQuery you get one huge table and adding new schema is automated using BigQuery mutator.

The same situation applies to particular building blocks of the pipeline.

On the top of that, there are configurations and other deployment tasks which require loads of work in the console. You may miss what’s important if you just want to grasp the basic concepts. Remember that you want to start playing with your own data schemas, data enrichments and modelling, not to do the work of Linux Admin.

How to avoid this overhead? Use Snowplow mini!

Fortunately, there is Snowplow Mini:

It’s a virtual machine image that contains what you need to start playing with Snowplow.

  • Snowplow data processing chain — Collector and Stream Enricher
  • Elastic Search for message pipelining
  • Kibana web interface for data exploration — in real-time, you can see good requests, bad requests, enriched data, etc. and you can use that for creating fancy visualisations
  • PostgreSQL for SQL data modelling
  • Integrated IGLU server to host your own data schemas
  • Simple web interface to control everything

The installation is fast — it’s only one virtual machine based on a ready-made image. You can deploy that into Google Cloud, AWS or any other solution (i.e. Digital Ocean, etc.). With a decent virtualisation setup you could probably run this image also on a local machine.

It is also pretty cheap. I used the a n1-standard-1 machine on GCP that was fast enough and it ran roughly for 20USD/month for my own experiments.

It would probably be a terrible idea to use Snowplow Mini for production data. But it’s optimal for development and testing environments. In just five minutes you can start sending requests and experiment with data visualisation and modelling. And you can quickly clean up everything in case something went wrong.

An integrated IGLU server allows you to develop your own data schemas, and test them quickly. Wheee.

With this you can focus on data analytics, not on cloud service configuration.

--

--

Jiri Stepan
Etnetera Activate

Vedu tým skvělých lidí ve firmě Etnetera. A zajímá mne ebusiness, cestování, sci-fi, divadlo, triatlon, ...