Real-time and Batch Data Pipeline At POPxo

Raman Damodar Shahdadpuri
POPxo Engineering
Published in
3 min readApr 2, 2018

POPxo has always been a data-driven company. We have kept an eye on data to check what’s working for us and what’s not. Since the inception of POPxo in 2013, the Data Engineering and Analytics team have worked together to answer complex questions regarding user behaviour.

Throughout these years, our main source of tracking data has been Google Analytics. Two years ago, we also started using Clevertap, mostly for higher customer engagement. As our user base grew in these years, we realised that we wanted to do more with our data which wasn’t possible with these tools. To begin with, we wanted to track our users across multiple domains, we wanted all our event data to be available in real time and most importantly, we wanted COMPLETE OWNERSHIP of our data. That’s where Snowplow came into the picture.

Snowplow at POPxo

Snowplow is an open source event analytics platform for mobile and web.

Snowplow provides two pipelines for event tracking, real-time and batch. At POPxo, we have implemented both the pipelines. The real time pipeline has Elasticsearch as the data sink. It helps us to see what our users are more interested in and showing more relevant content to them. It also helps us in fraud detection.

The batch pipeline runs on AWS’ EMR service every few hours. For the batch pipeline, we have kept Redshift as the data sink. Access to redshift has been given to our data analysts so that they can keep a track of data using SQL queries.

Snowplow takes advantage of various AWS services such as EMR, Kinesis, S3, Redshift and in our case, Elasticsearch as well. It is better to use AWS’ Elasticsearch Service because it can be easily scaled up as and when necessary.

Components Involved in Snowplow

Tracker

Trackers are used to fetch and send the user event data to the collector. Trackers are client or server side libraries. We currently use the Javascript, iOS, and Android trackers.

Collector

The collector is an HTTP web service which receives the data from trackers and forwards it to the raw data kinesis stream for further processing. We use the Scala Stream Collector as it helps us in tracking our users across multiple domains.

Enricher

Enrichers are used to widen the data dimensions and enriching the data. One example of data enrichment is IP lookup. This enricher fetches extra information about the user’s IP address and sends the information to the enriched kinesis stream.

Storage

This component involves storing data. By default, the batch pipeline saves data in S3. From there, the data is copied into Redshift. Snowplow provides an option to save data in PostgreSQL as well.

EmrEtlRunner Service

The EmrEtlRunner is service that is used in the batch pipeline. It fetches the raw logs from an S3 bucket, enriches it and finally stores the data in S3 and eventually in Redshift. All the steps involved in the EmrEtlRunner can be viewed here.

Data Visualization Tools

Snowplow does not provide an out-of-the-box data visualisation interface like Google Analytics and Clevertap. To overcome this, we started using 3rd party tools namely, Kibana, Redash, and Superset.

Kibana is used for the data that is stored in Elasticsearch. Kibana helps us in visualising our users in real time. It also has a dashboard feature with which we can create our own dashboard and track many metrics in a single page.

Redash and Superset are tools that we use with the data that is stored in Redshift. Data can be fetched using simple sql queries. Superset goes one step ahead by having great interface for exploring the data without writing SQL queries. Both provide great interfaces when it comes to data visualisation. The tools can be used with other data sources like PostgreSQL, Elasticsearch, BigQuery, MongoDB, MySQL etc.

What’s Next?

This was an introduction to event tracking pipeline at POPxo. So far, we have been able to track around 300 million events averaging to 2 million events everyday. But the more important questions is what do we do with this data? We have created a Bot Detection System and Recommendation Engine using the data that is tracked using Snowplow.

Want to be a part of POPxo? Check out our job openings page and let us know if you are interested.

--

--