Rules Evaluation: Percolation in Elastic Search

Industry: Mobile Adtech

Online advertising has become a very important means of marketing for most of the companies targeting the consumer space. These companies form part of various industrial verticals like FMCG, Automotive, Electronics etc. Most of the products from these companies are targetted at specific type of people. Here type might indicate something as simple as Gender, or something increasingly complex like a male between the age group of 25–36 living in Bengaluru(Bangalore), India.

Online marketing is typically split into various channels. A channel is a medium used for showing adds to the user, for ex before the advent of the world wide web the widely know channels were TV, Radio, Road Side Banners etc. For the online space these include at a high level

  • websites (static content like banner)
  • social media (static content like banners)
  • youtube (videos just like in TV) … etc

Apart from channels, increasingly the hardware used to serve these adds is becoming a important sub category in the above channels. The two primary categories here are desktop and mobile. Desktop is referred to any hardware that is stationary for ex your Desktop computer :), laptops etc. Mobile refers to any hardware that works on the mobile network for ex smartphones. The predominant difference between the hardware categories being, mobile can provide dynamic location (latitude, longitude) information and hence important habbits / traits about the user can be derived by looking at details of the location you have visited or accessed the web from, where as desktops are more stationary access points.

We provide a platform to clients to target mobile users for their marketing campaigns. Since clients have very specific requirements about who they want to target for their ad campaigns, our platform has to provide these attributes associated with deviceIds (users). Typically the building blocks of information like gender, age group, some sort of high level classification of users into groups like movie goers, sports fans etc are done using various data science models.

We wanted to provide clients with these building blocks so they could define their own target audience, for their campaigns. Such a definition would be captured as a Rule which is specific to this client, a campaign run by the client, a set of campaings run by the client during summers etc... We wanted to build a system that incorporate new building blocks over time and provide operators (and , or etc) over these building blocks to construct the Rule. So for example, a subset of the Rule definition for the target audience we defined earlier would be: Male and Age between 25–35.

As we had seen earlier location is a very interesting attribute that can used for targeting mobile users we also want to provide this in the rule definition. To include something like living in Bangalore India, we can take the latitude, longitude of a point near the center of Bangalore (12.980725, 77.588357) and a radius of 20 KM, hence the Rule now becomes: Male and Age between 25–35 and [latitude, longitude] within 20 KM from [12.980725, 77.588357].

Hopefully, the above gives an idea of what we wanted to achieve with this platform. Now lets get into some technical details.

Technical Context: BidRequests from ad exchanges is the first set of raw data we started working within our system(there were other streams as well which we would integrate with later). To start with we captured selective/important information from these BidRequests we get. We would then look at identifying any additional attributes that could be inferred deduced from the information we had captured for ex: home country::: defined as the country where the user lives, derived as looking at various [latitude, longitude] details for the device over a period of time and see which country they came from mostly. As you can see there is a series of attributes we had to maintain per user, with the series defined as a combination of spatial {space: (latitude, longitude)}and temporal {time} information sets.

We capture the above attributes about the user(deviceId) via an ingestion + streaming system into cassandra. The building blocks as we discussed above like Gender, Age Group, Behavioural Traits were determined as part of the streaming system where we profiled these devices using various data science models.

A Rule as we defined above had to incorporate geo targeting ability. Geo targetting can be defined as given latitude, longitude within a given range of a latitude, longitude or falling with in a polygon (a closed area with any number of edges / end-points). A lot of customers were interested in the former where they would provide specific locations they were interested as a list of lat, lon with certain radius

So we had the building blocks, we had geo targeting target definition, all we needed now was to determine which system do we use to execute these rules.

Approach:

There are various open source rules engine available in java, we did briefly read up on them and finally decided to try two options for our use case:

  • In house system using Antlr: This would give us great control over the grammar definition and tuning of the system. The problem though was it would take great deal of effort and testing to create a system that would evaluate rules correctly and it would be proportional to the every operation in the rule we would want to support.
  • Elasticsearch: Use percolator api of ES to evaluate rules defined by customers against a user document consisting of all the building blocks of Rule as we discussed earlier. This would give us access to a well defined/tested grammar with most of the operators we required like and / or / equals / geo filters etc, hence ease of development.

We spent a couple days to write code for the above two approaches to see if there was something different we would find out from what we knew. We did not find any surprise from what we expected.

Requirements:

Elasticsearch was the choice and the initial set of rules to be supported were simple:

  • Identify user for a country : We stored ISO country code of the home country the user belonged to. So this was just going to be a simple string match.
  • Identify user having memberships to specific segments: While building the pipelines to classify the users we had made a decision to have unique segment id’s across classification groups for example: Classification groups were gender / yob and segments were male, female , 25–30, 31–35,etc each had unique ids. hence this was a simple attribute with array of ids.
  • Identify users within a range of a given lat, lon with a frequency: We were already storing the spatial and temporal information in our attributes associated with the user, all we had to do was an aggreation per unique spatial point to get the frequencies. We had looked a the capabilities of percolation api and there did not seem to be anything that would support rule definition via aggregation. Hence we decided to only allow aggregation of certain frequencies like 1,3,5 etc and we did the aggregation over spatial information with in our pipelines.

To get the above in Elasticsearch we had to create an index and define how our user documents would look like, this definition had to be created with in the _mapping definition of the index. We would then create a percolation query per Rule under .percolator of the index. Defining percolation queries was simple, in the interest of being conservative for the string matches (country match) we used “not_analyzed” value for “index” associated with the property defining country. We mapped the spatial information using geo_point type in Elasticsearch. We got the basic mapping and query defined and a user document tested against it within a day. Now it was time to identify the scaling capabilities of percolation.

Scaling:

Questions:

  • We would keep seeing the users again and again and hence wanted to keep running the rules against them once we had enough information collected about the user since the last run (or new user). So what is the rate that ES would support ?
  • The business wanted to provide the clients with a maximum limit of 1500 geo targeting locations per rule within a radius.
  • Build a system such that it would support 10 clients each having about 10 rules defined so total of 100 rules

Observations on Data That might cause problems:

  • We saw that users were having locations that varied from 2-200 different points.

We created the tests using junit test. It was simple, everyone knew it, no additional setup was required, we had full control on all the variables we wanted to play with like:

  • how many client threads were hitting the server continuously
  • Percentage of matching documents we wanted to match a query
  • List correct locations [latitude, longitude] to use
  • Number of documents to go through per thread while testing.

Most of the configurations above were passed to the tests via the -Doption=value.

We had one client machine and two nodes Elasticsearch cluster on which we ran our test. All three machines were having about 64 GB RAM, 40 cores.

Results:

Below is a small sub section of the numbers we recorded. We were tewaking the percolation queries / user document format / number of indices etc, as we tested and got results back. Some of the tweaks that gave us better throughput are discussed below in Findings.

Findings:

  • Performance seemed to be better with percolation queries in a single index
  • To incorporate the pre-defined(system level) frequencies we had initially declared additional field in the location attribute in user document to include “frequency”. This didnt work well, we eventually changed to have array of lat / lon per frequency.
  • We were using filtered query for everything except country code match. Since most of the time country code could be used directly to reject a lot of documents and hence not do the costly geo filter matches we decided to use the “query_first” strategy in percolation queries.

With the above results and an idea of our real time user profiling QPS we moved ahead placing order for our ES cluster machines and deployed.

Things worked great for a few weeks till the requirements got turned on their head. More about it in the next post.