Understanding Humio’s Data Sources

I just had a great question on our Slack channel today, and thought I’d write up a small blog on how to deal with ingest performance.

When data arrives in Humio, it comes associated with a number of tags. These are distinguished values #tag=value, which are used to guide and optimize how data is stored and searched, and as such your use and management of tags impacts both ingest speed and query performance.

You can specify tags e.g. in your filebeat configuration, when you create an endpoint to receive data on, tags can be created based on the contents of the log line.

Ingest & Search

As data arrives at a Humio, it is internaly split into ‘data sources’. A data source is identified by the unique combination of tags. So e.g.,#type=accesslog,#host=prod01 is a seperate data source from #type=accesslog,#host=prod02.

When issuing a query that contains tags, say #type=accesslog, Humio will only search the data sources that match this tag. As such, tags are paramount for query perforance because they let you limit what data to scan through.

Each data source’s ingest is processed sequentially, on one machine in your Humio cluster. For each data source, Humio collects logs in a work-in-progress buffer of ~1MB. Whenever that WIP buffer is full, we compress it and flush to disk, appending the block to a segment file. Each data source has its own directory, containing the segment files that make up its log data.

All of this is nice and simple and contributed to making the implementaion of Humio relatively simple and efficient, which is good for many reasons.

However, there are three situations that pose challenges:

  • If a data source is receiving too much data to be able to process it sequentially (typically around 10.000 events/sec). If we didn’t do anything about this, then Humio would not be able to follow the ingest load.
  • If a data source is too slow, then it creates very small blocks which are not efficient for I/O purposes.
  • If there are too many distinct combinations of tags, we get too many data sources, using too many 1MB WIP-buffers, and too many directories which may push the limits of the file system.

The following describe how we deal with these situations.

High Velocity Data Sources

Dealing with high-velocity data sources is fortunately relatively simple: create more. We do this by monitoring the ingest load for each data source, and if it reaches a high water mark, then we simply split it by adding a random tag to it such as#humioAutoShard=0 or #humioAutoShard=1 thereby artifically splitting the data source in two, and increasing the ingest capacity by parallelizing the ingest processing. This automatically doubles up to 128 distinct auto shards, currently limiting us to ~1M events/sec on a single data source.

You can review the currently assighed autos harding by issuing a GET to /api/v1/dataspaces/$DATASPACE/datasources/$DATASOURCE/autosharding. You can see the data source ID’s in the UI under data space settings, it’s a long UUID.

Edit: Since writing this we have implemented automatic reduction of the autosharding level as ingest velocity becomes lower. See release notes for version 1.1.7.

Low Velocity Data Sources

If data sources are slow, they also create inefficiency. Every 10 minutes we flush WIP buffers, and if this is significantly less than 1MB, then the searching and I/O buffers will operate at sub-par performance. We flush buffers periodically, because we want to avoid loosing that WIP in case of a crash, which would mean we have to read it again from the Kafka that we use to buffer ingest. So, in worst-case we have to re-read 10 minutes of ingest from Kafka on restart.

In order to deal with small block sizes we merge small segment files into larger ones periodically. This is relatively simple: just reread the events, artifically ingest them again, and once done we can discard the old segment files.

High Variability Tags

Another problem is caused by having too many data sources, which happens if some of the tags have high variability. Say, you have chosen to make the IP address of the source host a tag, such as #ip=123.456.78.90 . This can easily overload Humio with too many WIP buffers (memory hungry), too many directories (limitations in the operating system) and put an uncomfortable load on the servers. For these reason, we currently reject ingest that would create more than 10.000 data sources per data space.

But there is also a remedy for this, you can set up rules to take certain tags and hash them into a lower variability value domain. We call this tag grouping as documented here.

What tag grouping does is that it takes the hash value of the incoming tags modulus a size that you decide. Say you want the #ip tag to be bucketed into just 15 groups, you would specify {“field”:”ip”, ”modulus”: 15} in your grouping rule. Given such a rule, the #ip tag would only contribute to 15 distinct data sources (but other tags may still make this number grow).

The value of your tag would then be stored in each individual event rather than just in the data source configuration itself, thus increasing storage requirements a bit, but reducing resource consumption on WIP buffers and filesystem load. When a search happens, say #ip=123.456.78.90, you would be scanning through all the data for which the hash of the tag #ip has the same value, selecting just those where the value actually matches. The user making the query doesn’t see the difference, except that the query optimizer now cannot be smart with searches like #ip=123.456.* because we cannot say which hash values would be produced by values that match that. So if you issue such a query, all data with an #ip tag would be searched.

Edit: As of release 1.1.7, we have implemented automatic enabling of tag groups for tags where Humio observes more than 100 distinct values, they will be set to 32 groups.

I hope this shed some light on how we work hard to make high performance ingest and search work for you.