ELK Stack 101
Delivering useful software that your customers value is not an easy task. Beyond the technical skills required, you have to have insight into how customers use your product and anticipate their needs. The Elasticsearch-Logstash-Kibana (ELK) technology stack can help you deliver better products, better understand the needs of your customers, and give you an advantage over your competitors by helping you gain new insights into your data.
ELK is an end-to-end stack that delivers actionable insights in real time from almost any type of structured or unstructured data source. In this post, I’ll help you discover how easily it is to create a minimum setup and configuration of the ELK stack.
- Elasticsearch is a distributed, scalable, and highly available real-time search with analytics capabilities and full text search technology built on top of Lucene. It is distributed by default and it is all JSON: put a JSON document in, and get a JSON document back.
- Logstash is an centralize data processing of all types, quickly extended to custom log formats, and allows plugins for custom data sources.
- Kibana is a flexible analytics and visualization platform which offers real-time summary and charting of streaming data.
These three tools from ELK stack work perfectly together because they are developed and maintained of the same company, Elastic. All components of this stack are open source, and you can use them free for all of your projects. To install them, you will find all the needed instruction into github repositories, listed below:
Why use ELK?
The ELK is creating a “pipeline” which allows you to control all the steps of data processing. For example, let’s say you want to log some server actions like connection to server or time of connection and the connection type. You use a logstash agent to take in the input data and throw the data into Elasticsearch. The data goes into Logstash in JSON and will be stored into the Elasticsearch the same way. You can query elasticsearch directly, or through an API, or you can simply supervise all statistics into Kibana dashboards.
How the pipeline works:
Let’s take the above pipeline and get to know each tool basic features.
Logstash runs as a service on the server machine, but it also connects to clients named agents that run on other machines, thus leveraging a client-server architecture in order to centralize the information gathered from data files. The agents are defined on the server machine using .conf files. For running your .conf files as default files for Logstash service there are two possible ways:
(1) copy your .conf files into /etc/logstash/conf.d and run the following command:
$ sudo service logstash restart
for restarting the Logstash service.
(2) go to /etc/init.d/logstash and change the default directory for Logstash service. In order to do that, you have to find the following lines:
and change the path to the CONF_DIR. After changing the default path and saving the file, restart the Logstash service and your are ready to go.
The simplest code example for an Logstash agent:
For the Logstash input you can choose from multiple options like input from: file, tcp, udp, twitter etc. You can see all available input plugins here, or with these input code examples:
Because we are focusing on the ELK stack, the code examples for output will be focused onto Elasticsearch output. You can see some more output plugins here, or with these output code examples:
There are many more configuration options supported by the Elasticsearch output plugin and you can use them into your projects. You can see all available configuration options here. Now that we have set up an input and an output for our Logstash agent, we should normalize the data that comes into the pipeline, so that we eliminate fields that are useless. In order to accomplish this we set some filters: Filter code examples: # set matching data filters for logstash agent
You can see some more filter plugins here.
Elasticsearch runs as a service on your machine/server and the configurations are defined into elasticsearch.yml. You can find the entire Elasticsearch directory layout here. We will try to cover most of the components of Elasticsearch, so as to better understand how it may fit into our pipeline.
Elasticsearch is a powerful search engine, but using it alongside Logstash and Kibana into ELK stack will simplify and reduce scaling problems like “how many shards do I need?”, “do I have to use more indices or keep all the data stored under the same index?”, and many others.
All the stored data is stored under one or more indices. Controlling the indices number is automatically done by the Logstash agent, which by default creates an index per day. This option will give you the opportunity to use tools like Curator, for easily outdating old data.
Using a web browser or a terminal from your machine you can see all indices, create/modify/delete indices, verify the indices status or even see indices settings, mappings, and indices health. All the commands with examples and detailed explications could be found at the Elasticsearch docs.
To create a more proper image about the indices and to help you organize them fast and easy, I suggest you to use the head plugin for Elasticsearch, which will help you a lot and reduce all your work. You can find more Elasticsearch plugins here.
Querying the Data
Once the pipeline is created and the Logstash agent is throwing data into Elasticsearch indices, all you need is to search through your data. Elasticsearch has various API endpoints for many programming languages, such as Python (elasticsearch-py or elasticsearch-dsl-py), Java, etc.. As well, you can query your Elasticsearch data using a web browser or a terminal, directly from the cluster indices.
$ curl -XGET 'http://localhost:9200/_search?q=tag:wow'
This command will search across all indices of all types. The above link says: connect to the host "localhost", on port 9200 (the port which Elasticsearch is exposing) and search the query "tag:wow". Using Elasticsearch you can create powerful and various queries integrating features like aggregation which allows you to work with buckets and generate Date Histograms and even Geo Distance correlations. You can find all available searching features into Elasticsearch documentation, at the searching section.
All optimization settings mentioned below must be added into the config/elasticsearch.yml file. These optimizations will help your cluster obtain better searching performance and improve its scalability.
- Limiting memory usage
The JVM is a limited resource that should be used wisely. Limiting the impact of fielddata on heap usage will diminuate the abuse of the heap, abuse that can cause node instability (thanks to slow garbage collections) or even node death (with an OutOfMemory exception).
Choosing the heap size using the $ES_HEAP_SIZE environment variable must respect 2 important rules:
- No more than 50% of available RAM
- No more than 32 GB (if the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes)
- Fielddata size
Elasticsearch does not load into the fielddata only the documents that are matching your query. It will bring into the memory all documents in your index, because is cheaper to search through preloaded data. When you want to query through a large index or to query through many indices, the fielddata can increase almost the heap size and your cluster will be stucked.
Knowing that, you may want that fielddata to work like context switching for processes at Operating Systems. When you run a query against Elasticsearch it will load the values into memory and then try to add them to fielddata. If the resulting fielddata size would exceed the specified size, other values would be evicted in order to make space.
By default, this setting is unbounded. For controlling how much space is allocated to fielddata, you have to add this line into the configurations file.
- Index refresh interval
The variable index.refresh_interval sets the async refresh interval of a shard. The right setting for this variable can provide a proper balance of operational safety and load coalescence because of the transaction log flushing threshold.
To control the refresh interval of a shard you have to add the following line into your configurations file.
More details about scaling Elasticsearch clusters can be found into Elasticsearch docs. Happy scaling!
Kibana is a HTML5 application that runs as a service on your machine/server and shows statistics about data stored into Elasticsearch cluster. Kibana uses port 5601 and offers a user friendly web interface that enables you to create and share dynamic dashboards that display the results of Elasticsearch queries in real time.
Once you have Kibana installed onto your machine/server the only thing you have to do is to access http://localhost:5601 and you are good to go.
ELK stack is a powerful stack that help you retrieve, store, and show data of all types aggregated from multiple platforms, using a scalable system. It can help you visualize data, make decisions and even understand the needs of your users or help you discover things you did not knew about your services :-) (like security holes).
- Elastic Learn
- What is ELK and how can it help you discover, visualize and analyze your data?
- Elasticsearch from the bottom up
About the Author
Andrei is a former Software Engineering Co-op at the Hootsuite Analytics backend team in Bucharest, Romania. He is passionate about product management and start-ups, everyday trying to know more than the previous day. Keep learning, keep busy.
You can follow him on Twitter @vaduva_andrei.