Using Elasticsearch for Reporting & Analytics

Reporting at Tyroo Technologies

While building the new reporting panel for our customers at Tyroo Technologies, we evaluated use of Elasticsearch for a certain component of the product. This article illustrates our thought process of choosing Elasticsearch for solving the analytics puzzle.

Being an AdTech company it is very important for our clients to know which of their campaigns are performing, what they can do to improve them and how they can optimize them for different devices and locations, based on device and geolocation based analytics. This can be only possible if we have detailed reporting about each and every campaign we are running. Earlier, for our customer interface, MySQL was used for reporting panel. It was alright upto some point, but challenges of large data started popping up with this implementation.

Our Current MySQL Reporting Architecture

  1. For our reporting needs from legacy perspective, we have always kept aggregated numbers for each campaign in MySQL, day or hour wise as per the requirements serviced by the report.

2. We have maintained a separate table for Impressions, clicks and conversions which keep track of each entry of corresponding event data like country, OS, version, browser, IP address etc.

Problems with Our MySQL Reporting

  1. One of our requirement was to show aggregate reports. As you can see we can easily show aggregated data but it will only be date wise. If a client wants some other metric, then we have to consider the second table. And aggregation/group by on the second table comes with its own cost.
  2. A client can group by (aggregate) data only on views which are exposed to them, as we can’t keep indexes on all fields as MySQL indexing comes with its own cost and this limits them from drilling down on different aspects of data.
  3. And at last nested group by on large dataset consumes too much time, for example, if a client wants to group by on country, OS, and browser, then it will take ages to show the result which ultimately results in bad client experience.
MySQL nested aggregation result on large data

These were some problems which compelled us to think about some other tool for reporting. We had heard quite a lot of Elasticsearch and thought that we should give it a try. This article lists down the methodology followed by us for getting our new reporting live which uses Elasticsearch in the background.

Before following the steps below, make sure that you have optimized your Elasticsearch cluster for maximum performance gain. You can get more info about production deployment checklist from here. Without that you will not be able to gain maximum performance or you might get into some trouble.

Our Implementation of Elasticsearch for Reporting and Analytics

For those who don’t not know about Elasticsearch, according to Wikipedia

Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

You can read more about elasticsearch from here. In next few steps, I will guide you on how we have configured Elasticsearch as our reporting and analytics ninja ;)

  1. Define your mapping template before indexing your data. Following is our sample mapping
{  
“template”:“sample-*”,
“settings”:{
“index.number_of_shards”:1,
“index.number_of_replicas”:0,
“index.cache.query.enable”:true,
“index.requests.cache.enable”:true
},
“mappings”:{
“sample”:{
“properties”:{
“app_id”:{
“type”:”string”,
“index”:”not_analyzed”,
“doc_values”:true
},
“service_id”:{
“type”:”string”,
“index”:”not_analyzed”,
“doc_values”:true
},
“city”:{
“type”:”string”,
“index”:”not_analyzed”,
“doc_values”:true
},
“country”:{
“type”:”string”,
“index”:”not_analyzed”,
“doc_values”:true
},
“timestamp”:{
“type”:”date”,
“format”:”epoch_millis”,
“doc_values”:true
},
“date_time”:{
“type”:”date”,
“format”:“yyyy-MM-dd HH:mm:ss” ,
“doc_values”:true
},
“device_id”:{
“type”:”string”,
“index”:”not_analyzed”,
“doc_values”:true
},
“device_model”:{
“type”:”string”,
“index”:”not_analyzed”,
“doc_values”:true
},
“device_os”:{
“type”:”string”,
“index”:”not_analyzed”,
“doc_values”:true
},
“lat”:{
“type”:”double”

},
“lng”:{
“type”:”double”

},
“location”:{
“type”:“geo_point”
}
}
}
}
}

As you can see we have marked each field as not analyzed, as we don’t need full-text search capability on any field, we will search or filter field using exact string. Another thing to note is that we are using doc_values to reduce the usage of a heap. We have also enable query cache to take advantage of Elasticsearch query caching.

2. Brainstorm on the fact that whether you need time-based indexing or not. We are using Monthly based indexing i.e. we have new index for each month. You can use daily based indexing or weekly or yearly depending on your need.

One of the main advantage of using time based indexing is that you can change your configuration at any point of time without the need of re-indexing the whole data like you can increase number of shards if you feel that you are getting lots of data from that point of time or you can easily backup your old indexes without compromising performance of your current index.

You might be wondering why I have mentioned template name (sample-*) in regex form? The reason being, all your indexes starting from “sample-” keyword can use this template, if you are using day based indexing like sample-2016–08–27, sample-2016–08–28 so on, cool yeah!

So basic advantage of creating template is that you don’t have to worry about creating new mapping with each index. Every time you create new index it will automatically pick your mapping from template provided that your index name has same format as template name.

Let’s say you got new requirement and you have to add new field in your index, delete your current mapping template, create new mapping template with new field and you are ready to go. Next index will pick your new mapping.

3. Now, what? Nothing, just index your data and you are good to go with your reporting and analytics.

Now, if you say that I have taken you to this point. Can I guide you on how to write basic queries and fetch data from elasticsearch for reporting and analytics ? I will say yes, in next few steps I will tell you basic aggregation queries, date histogram usage.

Basic Query for Reporting and Analytics

Aggregation or grouping is integral part of any reporting or analytics whether you want number of visitor from particular country, OS or browser you need to group them according to your need. Let’s see an example of aggregation.

  1. Aggregation on country
POST http://localhost:9200/sample-*/_search
{
"from": 0,
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"date_time": {}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"country": {
"terms": {
"field": "country",
"size": 5,
"order": {
"_count": "desc"
}
},
"aggs": {}
}
},
"fields": [
"*",
"_source"
]
}

Some points regarding above request

  1. sample-* will search on all indexes starting with “sample-” initial. If you want to search only one index provide complete name of that index like sample-2016–08–27.
  2. If you are only interested in aggregation number only, always use “size”:0 in request.
“size”:0 will avoid fetch execution phase in aggregation query which ultimately increases you aggregation query performance.

Result

{
"status": "success",
"data": {
"hits": {
"total": 5005,
"max_score": 0,
"hits": []
},
"aggregation": {
"country": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "IN",
"doc_count": 3163
},
{
"key": "US",
"doc_count": 1842
}
]
}
}
},
"statusCode": 200
}

Let’s say you want aggregation on country and state (Nested aggregation).

POST http://localhost:9200/sample-*/_search
{
"from": 0,
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"date_time": {}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"country": {
"terms": {
"field": "country",
"size": 5,
"order": {
"_count": "desc"
}
},
"aggs": {
"state": {
"terms": {
"field": "state",
"size": 5,
"order": {
"_count": "desc"
}
},
"aggs": {}
}
}
}
},
"fields": [
"*",
"_source"
]
}

Result is not shown here as it will make this post unnecessary lengthy.

2. You want to number of user per day then you have to use date histogram

POST http://localhost:9200/sample-*/_search
{
"from": 0,
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"date_time": {
"lte": 1459023455,
"format": "epoch_millis",
"gte": 141098456
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"date": {
"date_histogram": {
"field": "date_time",
"interval": "1d",
"time_zone": "Asia/Kolkata",
"min_doc_count": 1,
"extended_bounds": {
"min": 141098456,
"max": 1459023455
}
},
"aggs": {}
}
},
"fields": [
"*",
"_source"
]
}

Some points regarding above request

  1. You can set time zone according to your convenience. Best way is to store data in UTC and mention time zone when you are fetching data.
  2. You can also specify interval for grouping as in above request I have grouped it day wise, you can group it on minutely, hourly, weekly, monthly or yearly basis.

Aggregation and date histogram are basics of any reporting and analytics. You can use all sorts of queries, range filter, bool queries along with aggregation to gain maximum performance from your query. Just take care of few basics mentioned above to start with your reporting and analytics.

As you dig deeper and deeper in reporting and analytics, you will face new challenges every day, but each challenge is a new learning for you.

It would be my pleasure to get feedback through your comments and to answer your doubts or suggestions. Till then Happy Reporting :)