The count-distinct problem solved by CouchDB
Counting the distinct numbers of things in a collection is a memory intensive operation, but if you can accept an approximate answer there is a solution.
In 2017 I blogged about creating custom indexes outside of CouchDB for problems that didn’t fit CouchDB’s indexing engine. One of those was the count distinct problem.
Imagine you are recording web events as people (and bots) interact with your website — events that look like this in JSON documents in a CouchDB database:
{
"_id": "96f898f0f6ff4a9baac4503992f31b01",
"_rev": "1-ff7b85665c4c297838963c80ecf481a3",
"path": "/blog/post-1.html",
"date": "2018-12-04",
"time": "17:15:59",
"mobile": true,
"browser": "Chrome",
"ip": "85.25.222.52"
}
We want to report on the number of distinct IP addresses that have visited our site but as there are around four billion possible IP addresses, a count distinct operation can use vast amounts of memory. At the time, I proposed using Redis’s HyperLogLog index instead — this achieves an approximate count using only a fixed handful of kilobytes of memory.
As of CouchDB 2.2, a HyperLogLog-based reducer is available to allow approximate count distinct operations to be performed in MapReduce indexes. Here’s how it’s done.
The Map function
We are going to create a MapReduce index whose key is the thing we want to count (the ip address). A JavaScript function is defined as the “map” part of the MapReduce to emit the required key:
function(doc) {
emit(doc.ip)
}
This creates an index ordered by ip address — one item per document:
1.142.99.0
77.152.2.245
85.25.222.52
The reducer
The new _approx_count_distinct
reducer reduces this index to an approximate count of the distinct keys in the index:
{"rows":[
{"key":null,"value":50133}
]}
So across the data set there are approximately 50133 distinct ip addresses (with a relative error of 2%).
At query time, you may also examine ranges of keys. Let’s say you wanted to examine traffic from your own local 192.168.* network
#?startkey="192.168.0.0"&endkey="192.169.0.0"
{"rows":[
{"key":null,"value":99}
]}
Using the _approx_count_distinct in the dashboard
The MapReduce definition can be created in the CouchDB dashboard:
Add your map function and choose the _approx_count_distinct
reducer or enter _approx_count_distinct
in the CUSTOM text box.
Conclusion
The _approx_count_distinct
reducer has one job: to provide an estimate of distinct counts (with an accuracy of 2%) much more efficiently than getting a precise answer. It is used in the same way as the other built-in reducers _sum
, _count
and _stats
but remember that it acts on the index's key, not the value.