DGIM Algorithm

Madhura Joshi
Fnplus Club
Published in
5 min readJun 4, 2019
Making notes is my favourite thing to do😋

I wanted to write a blog from the past few days. I did have some ideas but did not know on what exactly should I start with. Recently, Bhavani Ravi published a blog ( How to get better at blogging) and Ali Spittel published a blog (tips for problem solving).

Answering to Bhavani’s question which is:

What are you currently learning?

I am currently studying Big data analytics. Learning and understanding the concept of Mining data streams and various algorithms used.

And following what Ali spittle told:

Make your own tutorial of what you learn and break down the problems for easier understanding.

So here I am writing a blog of what I learned and the way I learned. Hoping that you would understand this concept as well. Before I introduce you to the algorithm, let’s get familiarized with basic concepts.

MINING DATA STREAMS

Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records.

A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.

TYPES OF QUERIES

Ad-Hoc query- You ask a query and there is an immediate response. E.g: What is the maximum value seen so far in the stream S?

Standing queries- You are asking a query to the system say “ Anytime you have an answer to this query send me the response” , here you don’t get the answer immediately .

Now let us suppose we have a window of length N (say N=24) on a binary system, We want at all times to be able to answer a query of the form “ How many 1’s are there in the last K bits?” for K<=N.

Here comes the DGIM Algorithm into picture:

Tadaaaa!

COUNTING THE NUMBER OF 1’s IN THE DATA STREAM

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to represent a window of N bit, allows to estimate the number of 1’s in the window with and error of no more than 50%.

So this algorithm gives a 50% precise answer.

In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it arrives. if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.. the positions are recognized with the window size N (the window sizes are usually taken as a multiple of 2).The windows are divided into buckets consisting of 1’s and 0's.

RULES FOR FORMING THE BUCKETS:

  1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting with 1 on it’s right end.
  2. Every bucket should have at least one 1, else no bucket can be formed.
  3. All buckets should be in powers of 2.
  4. The buckets cannot decrease in size as we move to the left. (move in increasing order towards left)

Let us take an example to understand the algorithm.

Estimating the number of 1’s and counting the buckets in the given data stream.

This picture shows how we can form the buckets based on the number of ones by following the rules.

In the given data stream let us assume the new bit arrives from the right. When the new bit = 0

After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the buckets.

But what if the new bit that arrives is 1, then we need to make changes..

· Create a new bucket with the current timestamp and size 1.

· If there was only one bucket of size 1, then nothing more needs to be done. However, if there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the second step in the picture) We fix the problem by combining the leftmost(earliest) two buckets of size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them by one bucket of twice the size. The timestamp of the new bucket is the timestamp of the rightmost of the two buckets.

Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If so, we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may ripple through the bucket sizes.

How long can you continue doing this…

You can continue if current timestamp- leftmost bucket timestamp of window < N (=24 here) E.g. 103–87=16 < 24 so I continue, if it greater or equal to then I stop.

Finally the answer to the query.

How many 1’s are there in the last 20 bits?

Counting the sizes of the buckets in the last 20 bits, we say, there are 11 ones.

Ending with a motivational quote…

If you believe you can then you’re half way there.

--

--