Snowplow, Redshift, and Metabase — if you can’t measure it, you can’t improve it

Published in

StashAway Engineering

7 min readJun 18, 2019

If you are building a business that relies on making customers happy (unfortunately not a mantra that every company follows…) you will soon notice that in order to grow with customer demands, you will have to constantly work on improving your product. Even if your customers don’t openly complain your stealth competitor’s product launch might be just around the corner. Therefore, very soon after launch many startups ask themselves how to build an efficient and affordable business intelligence (BI) setup that allows them to understand what their customers are doing, and to feed reports to management and other departments that rely on data to drive their decision making process — in our case for example our product and marketing teams.

In this blog post I will guide you through our decision making process and the set of tools that we use for powering BI at StashAway. Of course, this is just a snapshot of our status-quo and may or may not fulfil all the requirements that you have.

Buy or build?

The first question that we asked ourselves was whether to buy a full-fledged solution from one of the big guys (and pay the big bucks), or whether to build something on our own. In general we are not big fans of reinventing the wheel and building basic infrastructure that’s not our core competency. On the other hand, (customer) data is one of the most important assets that a product company owns, and decisions around data structures and the paths that data takes as it moves through the company tend to have a long-lasting impact on many systems that you use today and (more importantly) you will able to use in the future. We don’t have a crystal ball and so we don’t know what are the systems that we will be using in a few years from now, but certainly we will have more options on the table if we control the raw data ourselves.

Long story short: in my opinion it would not be a good idea to integrate with a 3rd party solution and rely entirely on their tracking setup, unless you have an easy way to extract that data from their system. But even then you need to be careful, because the inflexibility of the system may have forced you into workarounds of structuring your data that may haunt you in the future.

The Stack

A simplified BI setup broadly consists of three parts: first you have to capture events (tracking), then you have to store them somewhere (storage), and ultimately you want to run some sort of visualization or reporting on it (analytics).

There are usually two different types of data sources flowing into your BI systems: user interactions (eg. a pageview on your website or a click in your app), and events (eg. an order placement or a money deposit). The reason it makes sense to classify data in this way is that the source systems and data ingestion is usually different for them. User interactions are tracked with a “real-time” tracking pixel type of setup (a URL that is called as the user interacts with the system), and events can also be processed in batch-style in the background (for example with a cronjob that walks through all new deposits and marks them for ingestion).

Tracking of user interactions

We decided to use Snowplow for the user interaction flow in the system. We are more specifically running two components, Collector and Enrich, as you can see in the diagram. The Collector exposes a tracking endpoint that will be embedded on our websites and our mobile applications and receive all user interaction related events via GET calls. It does not store these events but sends them directly to a Kinesis stream. That stream is pushed into S3 for raw data backup via Firehose, and also sent to Snowplow’s Enrich component. This enricher does some basic pre-processing such as detection of the device/version from the user agent, de-structuring of Google’s UTM tracking parameters, etc. The data is then sent onwards to the next Kinesis stream where Firehose picks it up again, performs some adjustments to the data via a Lambda function and writes it directly into Redshift (Redshift is a delivery target that comes with Firehose out-of-the-box: https://aws.amazon.com/kinesis/data-firehose/)

Processing of events

For any data that is not directly sent to Snowplow’s tracking endpoint, such as new customer signups, deposits, withdrawals, and so on, we use a different data pipeline to get data into Redshift.

We built a basic ETL job using Airflow that runs once per day and extracts data from the various databases that we connect to into S3, loads the data into Redshift, and processes it for the analytics systems that we are running. (so technically it’s rather an ELT process).

ELT in Airflow, with some coordination elements for data coming in from different country regions

Storage

That brings us to storing of the data: Redshift. You can think of Redshift as a (basically) limitless Postgres database into which you can pump your data and query it using familiar SQL commands.

Raw data flows into Redshift and is transformed using a set of SQL commands into two schemas (=sets of tables): bi and crm, to power the respective downstream systems. You don’t want to operate directly on the raw data, as this data is subject to change (for example if the table structure of your core database changes), and thus it is good to have this “staging” step to be able to provide compatibility options for the downstream systems if needed.

Analytics

Now the most interesting (and maybe important) part — you have all this data in your warehouse, but what are you going to do with it? For us, some of the most important use-cases are:

Create reports for different departments in the company
Analyze and fine-tune the performance of online marketing campaigns
Power our CRM system to send out communications to our customers

The tools we currently use for the above are a combination of Metabase (https://metabase.com/, open-source analytics), Google Analytics (you’ve probably heard of it…), as well as Emarsys (https://www.emarsys.com/, for CRM).

Conceptually, Metabase allows us to connect to Redshift directly and construct queries using a visual query builder, configure dashboards and reports, send daily digests into Slack, alert us when certain thresholds of data-points have been reached, and much much more. Since it’s a quite versatile tool and very user-friendly, many of our employees will have access to it. It does not contain any customer-identifiable information, of course — everything is anonymized (or rather not even exported from the core database systems).

When it comes to Google Analytics, it is a tool that a lot of people use to get up and running with their analytics efforts quickly. We did the same for our “marketing website” (all the pages you browse before you sign up to the actual product/app), as this is where we drive most of our marketing traffic, and thus the capabilities offered are sufficient for us to control our marketing campaigns. We are feeding back some data from the data-warehouse into Google Analytics in order to track users across the signup funnel and relate these numbers back to specific marketing campaigns for performance analysis. However, over time we are going to build all reports that we use in Google Analytics inside Metabase, as we have more control over the reports and data there, and can also easily enrich and correlate it with all the other data we have inside our data warehouse.

Lastly, we have a variety of communication campaigns that we control from our CRM system Emarsys. One of the main decision criteria to select Emarsys over any other system was the capability to connect Redshift as a data source and use any data inside of it for segmentation and dynamic content insertion. The alternative is normally to pump all required data periodically into the CRM system and essentially duplicate your data in it. We felt the approach of treating the CRM system as an additional component that simply reads from the same data storage as all other systems made much more sense.

Conclusion

As described above we have put quite a bit of internal development effort behind the data that powers our decision making processes. The benefit that you gain from having what I’d like to call “data awareness” in your organization outweighs the additional engineering challenges that come with it. In fact the challenges remain the same and will only become more complicated to solve the longer you don’t take control over a proper data pipeline inside your company.

So: don’t treat any decisions around your data stack lightly. Your CTO should sit at the table when marketing teams make decisions around tracking or CRM systems, and have a point of view on the data strategy moving forward.

We may explore a few more deep dives into some of the mentioned aspects of our data stack in the future. If you want to learn more, please feel free to reach out in the comments and let us know.

We are constantly on the lookout for great tech talent to join our engineering team — visit our website to learn more and feel free to reach out to us!