Bot Detection System at POPxo using Snowplow and Maxmind

Archit Ahluwalia
POPxo Engineering
Published in
3 min readApr 4, 2018

This post is in continuation to our post where we have written how we use Snowplow for tracking events.

Background

Apart from having an online community for women, POPxo has also built Plixxo, India’s largest Influencer Marketplace. Influencers are paid for sharing and posting content on social media and other platforms.

At Plixxo, we run many campaigns out of which one is the Share and Earn Campaign in which influencers are paid for sharing POPxo’s links. Each influencer is given a unique track code. This track code is added to any of the story link that the influencer would like to share. They then share this new link with other people and we pay them based on the number of views on these links.

The final goal of the Share and Earn Campaign was to drive more traffic to POPxo using Plixxo’s reach.

The Problem

Few days after launching this campaign, we noticed that few users were getting very high number page views. On investigating this issue, we noticed that the visitors’ IP addresses were of hosting providers like Amazon and Digital Ocean. Few Influencers(frauds) were using bots to create virtual page views. Manually going through every user’s data and identifying such IP addresses was next to impossible.

The Solution

To fix this problem in a scalable way, we had to do two things. One, we wanted to know the ISP information of the visitor’s IP address. This was done using Snowplow’s IP Lookups Enrichment which internally uses Maxmind’s Database. And two, we had to create our own list of ISP providers which we wanted to blacklist. This list was maintained in Plixxo’s database.

Steps Involved

  1. The data which is present in the raw kinesis stream is ‘enriched’ using the IP Lookups Enrichment i.e, extra information like ISP name and organisation, country, city, latitude and longitude etc. is fetched from the Maxmind’s geoip database. This data is eventually sent to Elasticsearch and Redshift.
  2. The data is fetched from Elasticsearch and for each page_view event we check the ISP. If this ISP matches with any of the ISPs in our blacklist, we do not consider this as a valid page_view and the user is not paid for this.

Outcome

After integrating the above system in Plixxo, we were able to observe two major changes:

  1. Page Views per day generated from Share and Earn got reduced to about 60% of the initial numbers. While this might seem like a loss, the page views that we were now generating were of genuine and legitimate users. This improved overall traffic at POPxo.
  2. Since the page views reduced, we were now certain that we were paying to genuine users and not bots. Hence, the cost reduced drastically.

This is just one application of things we do with data. We have also implemented a Recommendation Engine on top of the data that is tracked using Snowplow.

Want to be a part of POPxo? Check out our job openings page and let us know if you are interested.

--

--