Progressive index
Enable random accessibility for a stream of asymmetric data points
When we store multiple data points together in one stream, we need to make sure that the stream is deconstruct-able.
A trivial case being symmetric data points. What do I mean by symmetric? I mean that the size (in bytes) of each data point is equal. If we have a stream of 100 bytes and every data point is 10 bytes long, it is trivial to deconstruct the 100 bytes into ten 10 bytes data points.
Storing asymmetric data points (each data point has a random size) in one stream is a bit more involved. There are multiple techniques we could use:
- Putting a distinguishable separator between data points. For example words in an english sentence are separated by space, or other punctuation marks. An english word can not contain a space, or punctuation mark, therefore they are distinguishable separators. A more technical example is a NDJSON file format. With NDJSON the data points are valid minified JSON objects and the separator is a
\n
(new line character). The new line character is a distinguishable separator, because a minified JSON object does not contain a new line character. - The distinguishable separator technique is very common and natural to us humans, but it has a few flaws. First of all, distinguishable separator can be hard to identify, specifically if your data is a random binary stream. Another problem is the efficiency of breaking up the stream in to separate data points. In case of the distinguishable separator, we have to linearly scan through the stream to identify the end of one data point and the beginning of the other. It is possible to speed up the task by building up character bit vectors with help of SIMD instructions (see concepts behind simdjson), but it is still a complex and tedious process. A technique, which tackles both flaws mentioned before is a length prefix. With this technique, we store the length of the data point before we store the data point, this way we know that the data point to come is
X
bytes long. If we need to know how many data points are stored in the stream, we can read the first length prefix, then skip the number of bytes we just read to the next length prefix and so on. This way we can jump over data itself and do not have to ingest it all. This is more efficient than searching for distinguishable separator, specifically if the data points are large. - The length prefix techniques sounds like a perfect plan, but it has issues of its on. First of all, how are we to identify where the length value of the data point ends and the data of the data point starts? For this, there are two solutions. We store the length values in a fix amount of bytes (e.g. 4 bytes per value, which let us represent data values of up to 4GB), or we represent the length value as a VLQ. When a number is represented as VLQ, the 8th bit of every read bytes marks, if we need to continue reading the next byte, or we already read all the bytes which represent the value. If you think about it, VLQ is in itself a distinguishable separator technique and
X
bytes per value is a fixed size / symmetric value solution. A bigger issue with length prefix, is the fact that even if we can linearly skip through data points, we can’t do random value access. Random value access means, that we can accessX
th data point in a stream, without complex computations and ingestion of all the prio data points. In order to enable such capability, we need to store an index. Index by itself is a stream of data points, which we can use to efficiently read a single random data point out of the data stream. But if the index is a stream of data points, how do we store this data points in a deconstruct-able way? The problem becomes recursive. The most common solution for storing the index is to store the values as symmetrical data points…
Storing index values as symmetrical data points means, that we have to agree on the data point representation and its length.
For data point representation I would recommend a single unsigned integer
value, which stores the size of the data stream after the data point was added.
Let me provide an example. Say we have an array of strings:
["Hello", "my", "name", "is", "Maxim"]
And we want to store it as a continuous stream:
HellomynameisMaxim
In order to deconstruct the stream and make the data points randomly accessible we create following index stream:
[5, 7, 11, 13, 18]
We can read random values from the data stream based on a simple computation:
Say we need to read a data point at index n
in the data stream. We lookup the n-1
value in the index stream (or return 0
if n==0
) and lookup the n
value in the index stream. Given those two values, we got a range in which the n
th element in the data stream is located.
So in our previous example, the first value in the data stream is in range 0..<5
, second value is in range 5..<7
, third 7..<11
, forth 11..<13
and fifth 13..<18
.
Now we need to agree on the size of the value, or how many bytes we would like to spend on every index data point. Remember, in order to make the index randomly accessible, its data points has to be symmetrical.
We know that the index values are monotonically increasing positive numbers, but we probably can’t tell how large the last/biggest number will be. So the easiest solution is to be defensive and represent all numbers in large chunks (e.g. 4bytes per number). This means, we will store small numbers in larger chunks, similar to how Amazon puts small items into bigger boxes filled with packing material just to simplify the transportation complexity.
Another option is to keep the index values in large chunks in runtime memory and when we add the last data point to the data stream, repackage the index stream values into smaller chunks. This solution is a bit more space efficient, but it is computationally heavier.
This is where I would like to finally introduce my idea:
Progressive index for asymmetric data
The idea is based on the fact that the values in the index stream are monotonically increasing positive numbers. Instead of putting the numbers in one stream and perform padding for small numbers, we can group the numbers into multiple streams based on the value size. This is what progressive means. First items might fit in 1byte, then 2bytes, then 3bytes etc…
How do we build up multiple streams though?
There are multiple solutions, based on different use cases. I will present just one in this blog post.
Let’s say we develop some kind of a time series data persistence layer. The data itself. we append to a xxx.data
file. After append we check if the new size of xxx.data
file is representable in 1byte, if so we append the number to a xxx.index1
file, which stores all indexes representable in 1byte. If it is not the case, we append the index value to xxx.index2
, xxx.index3
, xxx.index4
, etc…
On random value read, we have a bit of a computation to do. In order to identify which index file contains the data for element n
. We need to compute how many elements are represented by which file. This is quite simple, we take the size of the index file and divide it by value size:
count1 = sizeOf("xxx.idx1") / 1
count2 = sizeOf("xxx.idx2") / 2
count3 = sizeOf("xxx.idx3") / 3
// etc ...
When we have the counts we can easily identify, which index file contains the values for n-1
and n
.
localIndex = n
counts = [count1, count2, count3]
for (fileIndex, count) in counts.enumerated() {
if count > localIndex {
return (fileIndex, localIndex)
}
localIndex -= count
}
And now we just need to get those values and read the data from xxx.data
file at range n-1..<n
.
Obviously read operation on progressive index is more involved than a simple index lookup, but the computation is not very expensive and comes with the benefit of space efficiency.
Thank you for reading my post. Please let me know if you have questions, or if you end up using progressive indexing in your project.