Using AWS OpenSearch to query supersets

Chandan Prakash
MiQ Tech and Analytics
4 min readNov 28, 2022

Chandan Prakash, Senior software engineer, MiQ

Background

Hub Discover is MiQ’s proprietary audience insights platform. It is an all-in-one visualization tool that helps marketers gather brand & campaign insights to inform their media strategy across channels.

Users can search across brands, third-party audience segments, owned audiences, site domains, locations, keywords, and devices to find an audience. Hub then visualizes data to provide detailed insights about each of these audiences across categories including demographics, device ownership and OTT/TV interests. This vast array of audience and insight data includes over 500,000 brands, 18,000 third-party segments, 80,000 owned audience segments, 55,000 domains, 78,000 geo entities and one million keywords spread across eight countries.

Querying this amount of data was never going to be easy

With these quantities, we knew the data would only be as good as the search capability we built. But with this super set of data, even basic querying presented a challenge. Our primary concerns were that processing would take too long and the results wouldn’t be refined enough.

Despite this, we still wanted to deliver an optimal user experience. So, we set ourselves several ambitious goals:

  1. Showing the most relevant entities for the search made, in their respective categories.
  2. Applying various filters on the search entities to filter them based on metrics.
  3. Showing the above result set within the SLA of one to two seconds.

And then we started the hard work.

Building the solution

To solve our problem, we decided to use the AWS OpenSearch platform. The elasticsearch AWS provided was the best option available in-market for our use case. And as we had already configured AWS, it was faster and easier to build on this rather than opting for a new tool.

Here’s how we implemented it:

1.Entity-level information such as entity name, entity id, category, geo, user counts, and other metadata is pushed into S3, through the databricks script output. This is done for each searchable entity required including brand segment, keywords, geo etc.

2.An index is created on the OpenSearch cluster and the above outputs are uploaded as JSON objects to the index. S3 LogStash is used for the bulk upload of these entities.

The JSON objects uploaded are stored as docs in the respective index, as shown here:

@timestamp  Aug 20, 2022 @ 13:58:10.779
@version 1
_id brand_seg_11111miq
_index aiqx-discover-index
_score 1
Brand_id 11111
Entity_type brand_segment
Hyperlocal_brand_id -
Hyperlocal_brand_name -
Name miq
Segment_info
{
"user_count": 2,
"geo": "SG"
},
{
"user_count": 4,
"geo": "US"
},
{
"user_count": 7,
"geo": "DE"
}

This is how the doc of the Brand Segment MIQ is defined in the index.

3.Once we have all the entities/docs loaded on the index, we use Search POST APIs to render the hits. We experimented with Query and Path parameters to apply various filters and clauses. Filters can be applied on name, entity_type, user_counts, geo, etc.

API call example: https://xxxxxxxxxxxx/discover-index/_search

Request body:

{
"query": {
"bool": {
"must": [
{
"match": {
"entity_type": "brand_segment"
}
},
{
"multi_match": {
"query": "miq",
"fields": [
"name^4",
"name.ngram^2",
"advertiser_name.ngram",
"aliases"
],
"lenient": true,
"operator": "and",
"type": "most_fields"
}
}
,{
"match": { "status": "active" }
}
]
}
},
"size": 10,
"from": 0,
"sort": [ "_score" ]
}

4.We also use search POST API calls to update user info after index creation. This helps us to keep the data live until a new refresh is done.

5.We need to recreate the index every time data refreshes. In order to maintain zero downtime for the users, we first create a temporary index, upload the latest data and then swap out the whole index. This makes sure the uptime is greater than 99.99%.

Results

So, did we reach our goals?

1.Showing the most relevant entities for the search made, in their respective categories.

We currently have over 5.5 million entities on our index with 15 different entity types. By presenting the top 18 result sets of all the entity types, we ensure users are shown only the most relevant data.

2.Applying various filters on the search entities to filter them based on metrics

Users are also able to add additional filters, enabling them to further refine their searches.

3.Showing the above result set within the SLA of 1–2 seconds

And the cherry on top — in spite of this vast data store and filtering, we were able to achieve our SLA of 1–2 seconds.

But our work isn’t finished. As we move towards cookieless solutions, we’re scaling Hub Discover to include more entity types, entities, and filtering capabilities.

Chandan is a senior software developer for MiQ, working from the Bangalore office. Super adventurous, he loves challenging himself through trekking, sports, and bike rides — he’s recently ridden from Kashmir to Manali!

--

--