Roaring Bitmaps : fast data structure for inverted indexes
This article describes at very high level , how roaring bitmaps work. Its not meant to be detailed and complete. Contains just enough information to understands basic technicalities.
What are inverted indexes?
Inverted indexes are used to provide fast element search on fairly large data sets. Indexes are numeric values to indicate location inside an data structure. Eg: array [11,22,33,22,44]. So inverted index is an mapping like
11- 0 & 22-1,3 & 33-2 & 44-4
Basically each unique element value maps to SET of indexes.
Why use bitmaps for storing SET of indexes?
Sets are fundamental data structures in computer science. Basically set of integers can be stored as array containing elements [1,13,36,45,72,80]. However each integer can take 32 bits, hence total set above can take 6*32=192 bits. If there are huge sets, that can easily take so much space.
Instead we can store just 80 bits and set bits at locations 1,13,36,45,72,80.
Why we need compressed indexes?
Bitmap dosent work well when array element values are sparse. For eg: [ 1,12,23, 500,510,522, 10000]. In this case, if we stored as regular set, it would take 7*32 bits. Instead, if we store as bitmap, it would require atleast 10k bits. This defeats purpose of bitmaps to save space.
So various compression techniques on bitmaps can be used. Most common kind is Run-length-encoding. RLE takes input like 0000000000000000000011111111110000000000
and converts to 20(0),10(1),10(0). It counts repetitions and stores along with character. Now think of binary input, it only contains 0 and 1. Hence lot of repetitions and hence can save lot of space.
What is roaring bitmap?
Example of inverted index is an index which stores word to set of document ids(docs containing that word). And to save space on storing set of docs, we use bitmaps. So it would look like- Pot: 000010, Pan:111010. Also we store these bitmaps in some compressed RLE format to save space further.
Now if someone wants to search for document containing both words Pot and pan, we first need to decompress entire bitmaps and then we can just to AND operation and get 000010. There is problem here, doing entire decompression which is very expensive operation and we can easily run out of memory space too.
There are many RLE based compression algorithms developed like WAH, EWAH, COMPAX. But all of them lack fast AND, OR, XOR and other operations required to implement fast index operations. This is where roaring bitmaps shines and differ. Roaring bitmaps dosent perform optimal compression, however in most cases it does fairly good job, however in exchange it performs index operations blazingly fast. Also it only decompresses only parts which are required.
How roaring bitmaps works?
Roaring bitmaps breaks up all integers in buckets of 2¹⁶ integers(65535). These buckets are called container. It stores individual container data in 3 different formats based on heuristics. Hence its hybrid compression.
Three different type of storage containers used:
Array container
Store as simple sorted array of integers, no bitmaps used at all, [1,13,45,100]. However it will not store common prefix, hence saving space. Searches are done through binary search. This sorted array grows dynamically based on heuristics. If number of elements in array container exceeds 4096, then its converted to bitmap container.
Bitmap container
Stored as 1024 * Word of 64bit , no compression on bitmap. Its fixed size on allocation and does not dynamically grow. It uses 64 bit instructions on processor, hence fast.
Run container
It is made of a packed array of pairs of 16-bit integers. The first value of each pair represents a starting value, whereas the second value is the length of a run. For example, we would store the values 11, 12, 13, 14, 15 as the pair 11, 4 where 4 means that beyond 11 itself, there are 4 contiguous values that follow.
Since 2¹⁶ integers are stored in containers, when index operations are performed, it only need to do against 2 containers at a time. Hence never running out of space.
References:
- Original paper : https://arxiv.org/pdf/1603.06549.pdf
- http://roaringbitmap.org/talks/