Elastic{ON} 16

Thilo Haas
smartive
4 min readMar 4, 2017

--

Last month we had the opportunity to visit the Elastic{ON} in San Francisco. Our goal was to deepen our knowledge of the Elastic Stack, since we already used Elasticsearch and Kibana extensively for several client projects.

Disclaimer: This post was originally written and published on February 29th, 2016 and has been migrated to Medium.

Elastic{ON} 16

Big things are coming with the release of Elastic Stack version 5.0! We peaked into a huge mass of new features in Elasticsearch, Logstash, Kibana and Beats and deep dived into the technology behind the stack. In the following I will try to point out some of our learnings and share some insights.

Elasticsearch

There are many different use cases for Elasticsearch apart from simple log analysis or full text search and many of those were presented during the conference. For example Eventbrite showed how they use the power of Elasticsearch to improve their event listings with personalized search using recommendations.

In all those fields the elastic team did a great work in further improving the features of Elasticsearch. In the following just some of the key features and improvements:

  • Data/Log collection: There will be an ingest node allowing to send data directly to Elasticsearch and thus get rid of logstash and simplify the setup of the stack for log collection.
  • Full text search: To improve full text search they will replace the tf/idf with Apache Lucene’s BM25 scoring by default. This should increase the relevancy of search results. Note that with BM25 smaller fields (e.g. title, tags, …) might not be weighted as high as with tf/idf, so one might need to increase the boosting of small fields depending on your use case.
  • Big data analytics and recommendations: Upcoming releases of the Elasticsearch-Hadoop plugin will support even more in-depth integrations with Hadoop as for example ES on YARN working with HDFS and further improvements as aggregations are on the roadmap.

There are many more features with v5.0, with very interesting improvements for geospatial data types and the introduction of a graph API, just have a look at the GitHub issues.

Logstash

Logstash will receive some improvements concerning the persistance of their queues. Until then make sure to use a persistant queue like Redis if you must not lose your data. Because if the logstash daemon gets killed, all messages within its queue will be lost.

Beats

The introduction of Beats as lightweight datashipers in combination with the Elasticsearch ingest node have the potential to redefine the way we do logging. Beats are written in Go and own Beats can easily be created thanks to the underlying libbeat library and various example Beats.

Several Beats already exist for interesting use cases:

As a proof of concept our buehler implemented a twitterbeat which continuously fetches all tweets from given twitter screennames. Feel free to give it a try and give us some feedback!

Insights

There were many interesting talks, all of them are available online and one of which I would like to point out:
Andrew Montalenti from Parse.ly talked about how they set up Elasticsearch for millions of real time log analytics every day.

Time based indexes for real time log analytics

Andrew Montalenti talked about how they set up Elasticsearch to handle 20TB of new data every month and still achieving real time analytics. They log every view and click on news and blog articles on different web pages.

There are two key points to achieve this:

  • Cumulatively merge statistics within a time period
  • Create separate Elasticsearch indexes for each time period and allocate them to hot/warm/cold machines

At the first stage they log every single page view and click on their sites as raw events and index it as one document in Elasticsearch.
Every 5 minutes a background job aggregates all views and clicks, grouped per page and writes a rollup document to a separate 5-minute Elasticsearch index containing the total number of views, clicks and users within this time period. Every 24 hours these 5-minutes summaries are again aggregated into a 1-day rollup. This reduces disk space and increases performance of calculations for analytics.

To further simplify the creation of merges, all Elasticsearch documents use the same schema, irrespective of beeing a rollup entry or not.

Elastic{ON} 16 — Parse.ly

To boost the performance of their Elasticsearch indexes they allocate the indexes according to their usage to different machines.
There are four types of machines:

  • Raw: mainly CPU used for raw entries of every single page view / click
  • Hot: mainly memory and CPU used for recent 5-minute aggregated documents
  • Warm: mainly memory and SSD used for older 5-minute aggregated documents
  • Cold: HDD used for 1-day aggregated documents

Thanks

Thank you Elastic for the awesome conference! It was a really nice get together of all folks working with the Elastic Stack, sharing their insights and giving a heads up about the roadmap. We are now even more confident that the Elastic Stack is the cream of the crop for our use of data analytics and search.

If you have any questions, feel free to contact us. We’re always looking for new challenging projects!

--

--