Know Thy Data

Published in

Engineering@Nauto

4 min readJan 30, 2020

How we leverage an in-depth understanding of our sensor data to achieve lossless and efficient compression for our customers

An artistic impression of the accelerometer signal during normal driving

There’s a well-known idiom, “Understand your data.” Still, we often underestimate the impact of understanding data. Take the example of IMU data coming from the accelerometer, gyroscope, and orientation sensors on the device. Now pick one of these sensor streams — let’s say accelerometer:

The data is sampled at approximately 200Hz.
Each sample has three 32-bit floating point values corresponding to each spatial dimension.
Each sample has both a hardware acquired sensor timestamp and an Android-given system timestamp.

Here’s an example of an accelerometer sample:

This data is useful for understanding collisions, aggressive driving, and many other things.

A 10 second sample of a single stream as gzip-compressed json takes about 66kB. For one hour of driving, this would be ~23MB. If we save this data in gzip-compressed raw binary, it would be about three times more efficient (i.e. a 10 second sample would take ~22kB).

Is there a way we can improve the compression ratio while also

keeping the compression lossless;
maintaining the sampling rate; and
keeping the computational overhead low?

It turns out that we can easily achieve an overall reduction of ~77%, or in other words compress to 1/4th the size of gzip-compressed raw binary data! I have not computed the Weissman score for this algorithm, but would be quite curious to know what you find out!

Let’s look at a few things we did to achieve these results.

Bits of precision

When the Android subsystem returns us the sensor stream, we get 32-bit floating point numbers. However, what does the hardware return? In our case, the hardware accelerometer has 16 bits of precision and is limited to +/- 16g. Thus, we know that the 32 bits of floating point data does not hold more than 16 bits of information. From this range, we have a scaling factor for converting the 32-bit floating point number back to a 16-bit integer.

The sample accelerometer values from above can now be written as:

A 10-second sample of this stream takes ~16kB, i.e., a reduction of ~27%.

How does this data change with time?

The data represents a time series extracted from a real world sensor. As a result, consecutive values are not going to vary wildly, but will generally have only a small change from the previous values. Computing and storing these differences between consecutive values for all except the first one would make the stream from above look like:

With this, a 10-second sample now takes ~8kB, i.e., an overall reduction of ~64%.

Column vs row layout

After we compute the differences, and in some cases before, we often see consecutive values are the same. Such repetition compresses slightly better when it’s represented in a columnar format. A 10s sample takes ~7kB, i.e., an overall reduction of ~68%.

Data storage format

An artistic impression of the accelerometer signal during a collision

The differences tend to be small, but every now and then, they can spike. For example, the consecutive accelerometer values can vary wildly in the case of a collision. Therefore, we are still limited to storing 16 bits as one would want to accommodate for the extremes. What if there was a data encoding format that could represent numbers with a variable number of bits depending on how large they are? It turns out there already exists one such encoding scheme, referred to as “variable integer encoding”. For full details of this encoding scheme, read more on variable-length quantity on Wikipedia.

Additionally, it turns out that protocol buffers (apart from having numerous other benefits) provide a variable integer encoding scheme so this does not need to be implemented by oneself.

Conclusion

With all this, a 10-second sample now takes ~5kB, i.e., an overall reduction of ~77%! Every optimization we’ve made is lossless, is relatively easy to implement, and has very low computational overhead. Further, this cascade of techniques can be adapted for various sensor streams.