The anatomy of a Druid segment file
At Optimizely, we’ve been developing a counting service based around Druid for a couple of months (see our previous post to learn why). In this post, we’ll go into some technical details behind Druid’s storage format.
Druid stores its data in ‘segment files’, and if you use Druid in production you’ll find yourself optimizing these in various ways to match your query pattern: ensuring they’re the optimal size, that they’re partitioned suitably, and that they’re replicated and balanced properly across your query-facing nodes to prevent any subset of nodes from getting too ‘hot.’
But segment files are more than just something you’ve got to configure properly — they’re also the key to Druid’s extremely compact representation of data. While Druid’s documentation has improved, some of its features and inner-workings are still rather mysterious unless you’re willing to dig through the forums or excellent white paper. So at the last Druid meetup, I took the opportunity to ask the core contributors to outline the anatomy of a segment file, which I’ll explain in the following example.
Update: I have contributed the meat of this post to the official druid documentation, so you can also find most of this information here.
Example: Wikipedia Edits
We’ll stick to the Wikipedia edits example covered in the druid white paper and the tutorials. Let’s imagine we want to store the following data:
Each column is stored separately within the segment file, which means that Druid can save time by accessing only those columns which are actually necessary to process a query.
Timestamps, dimensions, and metrics are the three basic column types in Druid. The timestamp and metric columns are ‘simple’ and so we won’t spend much time describing them: behind the scenes each of these is an array of integer or floating point values compressed with LZ4. Once a query knows which rows it needs to select, it simply decompresses these, pulls out the relevant rows, and applies the desired aggregation operator. As with all columns, if a query doesn’t require a column, then that column’s data is just skipped over.
Dimensions are what you’re allowed to filter or group by on, so they require a more complex, indexed structure. For each dimension column, druid contains the following three data structures.
- A dictionary that maps values (which are always treated as strings) to integer IDs
- A list of the column’s values, encoded using the dictionary in 1.
- For each distinct value in the column, a bitmap that indicates which rows contain that value.
Consider the ‘page’ column from the example data above, pictured left. Note that the bitmap is different from the first two data structures: whereas the first two grow linearly in the size of the data (in the worst case), the size of the bitmap section is the product of data size * column cardinality. Compression will help us here though because we know that each row will have only a single value of 1 in each column, which means that high cardinality columns will have extremely sparse, and therefore highly compressible, bitmaps. Druid exploits this using compression algorithms that are specially suited for bitmaps, such as roaring bitmap compression.
Hidden killer feature: multi-value columns
After explaining these points to me, one of the core contributors off-handedly mentioned that Druid also supports multi-value columns, something which I consider to be a killer feature, and will certainly come in handy in our use case here at Optimizely. This came as a surprise to me, because multi-value columns are not introduced anywhere in the Druid documentation.
To understand this feature, consider the case where you simply want to tag Wikipedia pages with topics multiple topics — thus, a single page might be tagged with both ‘Justin Bieber’ and ‘Ke$ha’. Without multi-value columns, you’d have to create two nearly-identical rows in Druid, and possibly more so that you can query the number of pages without double-counting.
With multi-value columns, tagging is natural and efficient — a row’s value for a column can simply be an array of tags, and will be represented in the segment file as illustrated to the left.
The query semantics of multi-value columns are by and large what you’d want and expect: an OR query will return return a record if any of its multiple values evaluate to true. It’s also possible execute an AND query with multiple conditions on the same column and receive a return non-empty result set — e.g., if we queried for rows where page=`Justin Bieber’ AND page=`Kesha’, we’d receive the highlighted row in the result set.
The only query that performs somewhat unexpectedly is group-by. The result set of a query on the ‘page’ column in our example would include a count for each unique page. If a row has n vales in the page dimension and we run a group-by on the page dimension, then the result set will count the row n times. That’s because the resulting counts are aggregated by unique value, and the row will increment each of it’s value’s counts. This behavior is probably what you want, but you have to be mindful that summing the counts from each grouped-by value can lead to over-counting the number of rows.
Druid segment files are not quite as simple as that. They also contain a meta-data field that describes the version of the file so that the Druid devs can update the segment file spec and still allow backwards compatibility. Furthermore, the data part of the files is actually segmented into 64k chunks that are compressed separately.
PS If you’re interested in this type of work, we’re hiring.