Hyper … Log … Log … A Simple Estimation of The Number of Unique Elements in a Large Data Set
We are increasingly moving towards large datasets, and where we need to build hashtables based on the data elements that we have. But, how do we count the number of data elements that are unique? Well, first we have to parse our input data into data elements. As a simple example — in Python — we can just take a string, and then parse it for words:
data1 =str1.split(" ")
Once we have these words, we can then create a data set from this, and then count the number of elements (cardinality) in the data set:
s1 = set(data1)print("Actual cardinality is", len(s1))print("Data set: ",s1)
The creation of the data set can be a computationally intensive operation, thus a method is known as HyperLogLog [1] and which can quickly estimate the number of unique elements.
In this case, we will use a HyperLogLog method, and compare it with the creation of a dataset in Python [here]:
import sysfrom datasketch import HyperLogLogstr1 = "One two one two one two one two one two four"if (len(sys.argv)>1): str1=str(sys.argv[1])