Probabilistic data structures
We are familiar with data structures that return correct answers. What if we allowed a few errors just to gain speed or reduce memory usage? Maybe we are fine with being wrong 1 out of 100.
Probabilistic data structures trades off wrong or approximate answers for memory space, constant query time and scaling.
There are few important probabilistic data structures that excels at solving two kind of problems: the cardinality problem (counting elements in a dataset) and the membership problem (checking the presence of an element in a dataset).
Some show an interesting feature: you can fix the memory size. You know in advance that it wont grow.
Reading list
Bloom filter
Memory and speed efficient at testing membership.
https://bdupras.github.io/filter-tutorial/
Cuckoo filter
Even better than Bloom filter. It use less memory and can also count elements.
http://blog.fastforwardlabs.com/2016/11/23/probabilistic-data-structure-showdown-cuckoo.html
HyperLogLog
This one was created to count the number of distinct element in a dataset.