Probabilistic data structures

François Hoehl
Algodeck
Published in
1 min readAug 21, 2017

We are familiar with data structures that return correct answers. What if we allowed a few errors just to gain speed or reduce memory usage? Maybe we are fine with being wrong 1 out of 100.

Probabilistic data structures trades off wrong or approximate answers for memory space, constant query time and scaling.

There are few important probabilistic data structures that excels at solving two kind of problems: the cardinality problem (counting elements in a dataset) and the membership problem (checking the presence of an element in a dataset).

Some show an interesting feature: you can fix the memory size. You know in advance that it wont grow.

Reading list

Bloom filter

Memory and speed efficient at testing membership.

https://bdupras.github.io/filter-tutorial/

Cuckoo filter

Even better than Bloom filter. It use less memory and can also count elements.

http://blog.fastforwardlabs.com/2016/11/23/probabilistic-data-structure-showdown-cuckoo.html

HyperLogLog

This one was created to count the number of distinct element in a dataset.

http://antirez.com/news/75

--

--

François Hoehl
Algodeck

Computers and graphic design. Growing a beard on the side. he/him/french