Understanding Count-Min Sketch
In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of over counting some events due to collisions.
In an ideal case for retrieving frequency of any streaming data we use HashTable as we can Store the HashValues in the Hash table and retrieve them easily at O(1). But by doing so we are storing all the data in the hash tables which will fall in to super liner Memory usage for very large (infinite) streaming of data.
To tackle this memory in efficiency model , we can use count min sketch to calculate the frequency in sub-liner space , because in this case we will not be storing the complete values of data stream , instead we will use a matrix to compute the frequency, where number of rows would be number of Hash functions we are using and columns would be number of out come of the HashFunctions.
Lets understand the count min sketch algorithm with an example
Lets say we have a stream of data
Stream = {A,A,B,A,B,D,A……..}
Lets define 4 hash functions H1,H2,H3,H4 and lets assume the below table for their out puts as shown in pic..
Now Lets Create a Matrix to keep a track of count of input streams
Here is the matrix of Hash function X possible Outputs (Matrix-1)
Now for each data from stream now lets calculate the Hash outputs and increment the corresponding counter in the table….
H1(A) = 1, H2(A) = 3, H3(A) = 1, H4(A)=2
So by increasing the corresponding counts in the matrix we will have .. matrix 2
Similarly next in stream we again get A so lets update the table .. so now the table will be matrix 3
Next in the stream we have B .. So the Hash output of B is H1(B)= 3 ,H2(B)= 5 , H3(B) = 3 , H4(B) =1 … so lets update matrix accordingly matrix 4
Next again A so updated matrix is matrix 5
Next in Stream is B so will be matrix 6
Now Next in Stream is D
H1(D)= 2 ,H2(D)= 1 , H3(D) = 4 , H4(D) =4
So updated matrix will be….matrix 7
Now again A…. so updated matrix is matrix 8
(Below Figure shows the Matrix at each stage…)
Now lets calculate the frequency of A…
Again pass A to all hash functions and result is H1(A) = 1, H2(A) = 3, H3(A) = 1, H4(A)=2
Now take the array of these positions in matrix which comes to {4,4,4,4} .. so minimum of this comes to 4 so The frequency of A= 4.
Similarly Lets calculate frequency of B H1(B)= 3 ,H2(B)= 5 , H3(B) = 3 , H4(B) =1, So the frequency = min { 2,2,2,2} = 2
In some cases due to hash collision we might get the frequency little more than what is expected to come, hence it guarantees to give the exact frequency or more. The accuracy will depend upon how unique the hash functions return the values… and also the more the hash function the more accurate will be the frequency.
In this way Count min sketch allows to calculate frequency of large data streams in sub linear space using same O(1) constant time complexity.
Count-Min Sketches are incredibly efficient. A 1000x8 Count-Min Sketch (that is, 8 hash functions that each map to a 1000-length array), would need to store eight-thousand 32-bit integers. That’s a total of 32 KB — literally orders of magnitudes smaller than the massive 32 GB Hash Table we were contemplating earlier!
Applications
- Compressed Sensing
- Networking
- Databases
- NLP, Security, Machine Learning..