Photo by Mathias P.R. Reding on Unsplash

Hyper … Log … Log … A Simple Estimation of The Number of Unique Elements in a Large Data Set

--

We are increasingly moving towards large datasets, and where we need to build hashtables based on the data elements that we have. But, how do we count the number of data elements that are unique? Well, first we have to parse our input data into data elements. As a simple example — in Python — we can just take a string, and then parse it for words:

data1 =str1.split(" ")

Once we have these words, we can then create a data set from this, and then count the number of elements (cardinality) in the data set:

s1 = set(data1)print("Actual cardinality is", len(s1))print("Data set: ",s1)

The creation of the data set can be a computationally intensive operation, thus a method is known as HyperLogLog [1] and which can quickly estimate the number of unique elements.

In this case, we will use a HyperLogLog method, and compare it with the creation of a dataset in Python [here]:

import sysfrom datasketch import HyperLogLogstr1 = "One two one two one two one two one two four"if (len(sys.argv)>1): str1=str(sys.argv[1])

--

--

Prof Bill Buchanan OBE FRSE
ASecuritySite: When Bob Met Alice

Professor of Cryptography. Serial innovator. Believer in fairness, justice & freedom. Based in Edinburgh. Old World Breaker. New World Creator. Building trust.