Needle in a haystack

Using Elasticsearch to help running the Large Hadron Collider of CERN


(Disclaimer: This article shows only the private opinion of Gergo Horanyi, does not represent the opinion of CERN (even though ‘we’ is used several times). This writing was not supported by Elasticsearch BV or by anyone else. No confidential information about CERN is published in this article. CERN is large, thousands of people are working together for the same goal. This project is only a small link in the chain, not more.)


As one of the many young computer engineers at Europe’s largest particle physics research center, I often find myself wondering, how this could have been done? How did we manage to create and operate the largest particle accelerator on Earth?

The Large Hadron Collider (or LHC for short) has many amazing facts and figures. The circumference of the machine is 27 kilometers. Not just the number of magnets is high (just for example, there are more than 1200, 15 meters long dipole magnets in the LHC), but we have huge computing capacity (called CERN Grid) to analyse the data coming from particle collisions in one of our 4 detectors. Not to mention the several thousand servers, which are responsible for the smooth operation of CERN’s accelerator complex.

https://www.youtube.com/watch?v=3wtUr3iVVIw

CERN Control Center (Maximilien Brice, CERN)

There is a special place at CERN, called CERN Control Center (or CCC), which functions as the brain of CERN. The LHC and many other systems are controlled from this place, by only a handful of operators. They are keeping an eye on thousands of parameters 24/7.

Even though CERN has 60 years of experience, from time to time, problems happen. There are various protection systems ensuring that if something goes wrong, all devices are driven to a safe state. Safe for us, and safe for the device itself. While our high energy beams are smashed into huge cylinder formed carbon absorbers (shielded with hundreds of tons of concrete and iron), each and every system gets ready for diagnosis and restart.

The diagnosis part is where Elasticsearch comes into the picture. Our devices are continuously sending log messages about their state. We have many different solutions for this, which means, that we have many different logs at the end. And that is never good. But this comes from the architecture and the age of our systems, and we can’t really change it.

When we start to diagnose problems, we have experts, rushing into the CCC in the middle of the night. They have to do this, because we need to restart the device as soon as possible. First, because it is very expensive, even if it is stopped and second, because the physicist are inpatient. They need data for their next Nobel-prize.

First of all, the expert has to figure out, where the log is, and how can it be accessed. Then the expert must understand the format of logs. For this, we might have some documentation, but might not (yes, it happens). It might happen, that the same data field in two different logs has completely different meanings. This might be easy to notice, but in some cases this is easy to misunderstand. As a result, we might end up with a wrong diagnosis, which might delay the restart.

So, what did we do?


Well, we installed the ELK toolstack. And we are happy with it. We started to feed all available logs into one common system. Not just importing once, but continuously transferring data to our new system. During the transfer we can convert the logs to a common, well-documented and unambiguous format. If everything is in one place, the experts can find what they need by just searching in a huge database. And this is what Elasticsearch gives us. It is a toolstack, with a powerful indexing service in its heart.

The goal of this article, is to share our experiences. Even though your company most probably does not have a particle accelerator in the backyard, the systems behind are similar enough to find our experiences useful. Just below you will find a quick overview of our systems, and some words on how ELK could perform and what difficulties we had.

Logstash, and other log inputs


Lets start with the bottom of the stack, with the data inputs. In general, Logstash has surprisingly lot modules already implemented (inputs, codecs, filters and outputs). These are available in the distribution, but also their source is online. (Did I mention, that the entire toolstack is opensource? Awesome!) When you want to use one of the existing modules, you just configure it in the configuration file, and your job is done. The problem starts, when you want to create your own inputs. Of course, you can do that, even tutorials are provided. The problem is, that is in JRuby. In our case, we have a huge Java codebase. In the environment I work in, everything is written in Java. I sleep and get up with Java. We are trying to avoid to create any software not in Java, because we have a huge software system, and it would be impossible to maintain. Therefore, when we wanted to add some custom inputs to our system, we implemented it in Java, thus we had to partially reimplement Logstash’s functionality (like various queues, Elasticsearch output, filters, etc…) But this was surprisingly easy, and actually now we are working on some lightweight JRuby wrappers, because then we could contribute back some general parts to the community.

We found one other issue with Logstash (in which our quickly created Java tool was much better). Logstash lacks metrics. Some time ago, they had some, but it was removed to increase performance. And now they only have plans. But metrics are crucial for large infrastructures. We have monitoring tools, which can alert us, if something goes wrong. But for that, we would need some metrics. At then end, we wrote some shell scripts (executed by our faithful employee, m/monit), which can warn us in case of problems with the data flow (by periodically checking the amount of data in Elasticsearch).

The core: Elasticsearch

Elasticsearch is the biggest beast in the toolstack. It provides the most functionality: it stores, indexes and searches data. Even so, we had the less problems with this. It is amazing, how easy to use it is. Just watch one webinar on their website (which is free also), and you will see how powerful it is. Even with the default configuration you have a system with one node running in minutes. If you spend 2 minutes on configuring, you have a redundant cluster. All right, before you put your cluster in production, you should still do some more configuration, but there are nice hints on the internet. Of course, if you have lots of data, you need powerful machines. Especially, memory. There is no such thing like too much memory in your node (obviously kidding, you should not use more than 30 GB as your ES_HEAP_SIZE).

You should spend some time on designing your cluster(s). What do you need? Just data storage redundancy? You need to serve high search load? In our case, we have only two nodes. We replicate the data all the time and we don’t need much more, as we don’t have so many search requests. What we really need is redundancy, to be sure, that the data is always available.

Screenshot from Marvel (http://elasticsearch.org)

Of course we have a development environment for testing, and we have a monitoring “cluster” for Marvel. Marvel is a tool in the stack for monitoring ES clusters. It collects and visualizes everything you need to know about your cluster. This tool is not free, but up until 5 nodes, only 500 $ a year, so its really worth the price. Besides having real-time data, you have fine grained history, which can help to understand what is going on with your system.

A few words about scalability and performance. Our system has roughly 1000 documents coming in per second. If everything goes fine. Last week we have seen 22000 messages a second. This is a huge number of documents, but Elasticsearch can deal with this (if you give enough memory to the nodes). The search is fast enough, we don’t really have performance issues. We actually don’t store the data for such a long time, after 7 days, usually we drop most of the documents. Some heavily duplicated messages we even drop after a day. We achieved this by properly organizing indices, which we can drop easily from cron jobs via the REST API. Even with this, we usually have more than 500 million documents in the system, without any problems, which shows the potentials of the tool.

And last, but not least, the top of the stack: Kibana

Kibana is made for presenting raw data from Elasticsearch in a pleasant way. We have various views, charts, diagrams and everything you need. Even the default dashboard is useful, but if you start customizing, you can really get out some extra information. Just a simple example: one of our system sent the temperature of some low-level devices in log messages. Now we can easily create dashboards showing the trends, which sometimes useful. What we really liked about Kibana, that the application developers can create their own dashboards, and they can monitor their systems on their own, without any help from some other team. We are still at the beginning with this, but we hope this will increase the provided quality of service.

Kibana tries to be as user friendly as possible, but it still needs some time to improve. Strange error messages appear from time to time, the query / filter language (which actually comes from Elasticsearch and not from Kibana) is far from being obvious, but overall it is nicely done.

This is the third opensource tool of the toolstack, and this is a great advantage again. We wanted to contribute some minor extensions to one of the panels, and we could do it, easily in one day (we have added a set of standard or default fields, to avoid showing hundreds of fields in the field explorer of the table panel, soon to be shared on github). So, in general, Kibana is well done, usable by non-experts, we really recommend it, and we are looking forward for the newer versions.

Closing thoughts and ideas

All together, the ELK toolstack has three (+ 1 with Marvel) tools, which are really handy in many situations. Very easy to start with, the first results will show up almost immediately and it gets even better as time passes.

Besides using the tools from the toolstack, you can easily build applications on top of Elasticsearch (see case studies), which we already started.

We see that the company behind (Elasticsearch BV) is flourishing, the community is vivid and most importantly the product is awesome.

Of course the product is not complete, but it evolves quickly. In our case, we would have profited from some kind of life cycle management tool. Mostly for the ES nodes and indices, but also for Logstash. For example, it was a bit tiresome to log in to the machine, edit the configuration file and then restart the entire Logstash instance just because one character should be changed in the filter. Or it would be nice to manage (not just show information about) nodes from a central management tool. And it would be also nice, if there would be some kind of tooling for automatically deleting/moving indices, and we don’t have to create our own scripts for that.

Did I mention the great community behind ES? ☺ @jlecour just pointed out curator from Elasticsearch, which solves many things we missed. Thanks!

But these are just minor things. If you want to remember only one sentence after reading this rather long article, then be the following one:

“The ELK toolstack is cool, we should have a look.”


If you have any comments, please leave a note. I would really appreciate it, as this is my first writing here at Medium. If you are really interested, come and take part in CERN’s mission ☺.