The count-distinct problem solved by CouchDB
Counting the distinct numbers of things in a collection is a memory intensive operation, but if you can accept an approximate answer there is a solution.
In 2017 I blogged about creating custom indexes outside of CouchDB for problems that didn’t fit CouchDB’s indexing engine. One of those was the count distinct problem.
Imagine you are recording web events as people (and bots) interact with your website — events that look like this in JSON documents in a CouchDB database:
We want to report on the number of distinct IP addresses that have visited our site but as there are around four billion possible IP addresses, a count distinct operation can use vast amounts of memory. At the time, I proposed using Redis’s HyperLogLog index instead — this achieves an approximate count using only a fixed handful of kilobytes of memory.
As of CouchDB 2.2, a HyperLogLog-based reducer is available to allow approximate count distinct operations to be performed in MapReduce indexes. Here’s how it’s done.
The Map function
This creates an index ordered by ip address — one item per document:
_approx_count_distinct reducer reduces this index to an approximate count of the distinct keys in the index:
So across the data set there are approximately 50133 distinct ip addresses (with a relative error of 2%).
At query time, you may also examine ranges of keys. Let’s say you wanted to examine traffic from your own local 192.168.* network
Using the _approx_count_distinct in the dashboard
The MapReduce definition can be created in the CouchDB dashboard:
Add your map function and choose the
_approx_count_distinct reducer or enter
_approx_count_distinct in the CUSTOM text box.
_approx_count_distinct reducer has one job: to provide an estimate of distinct counts (with an accuracy of 2%) much more efficiently than getting a precise answer. It is used in the same way as the other built-in reducers
_stats but remember that it acts on the index's key, not the value.