# The streaming model, and how to estimate the most frequent elements with the Misra-Gries algorithm.

Hey everyone! I’m taking my computer science master degree, and currently studying some algorithms that are not very spoken about in the mainstream literature (aka. medium articles and other blog posts). If you would like to learn about algorithms that are not so much spoken about but don’t have the patience to dive into thick research papers, this article is for you.

This post is about the **Mesa-Gries** algorithm. It is a rather simple algorithm, that is used in the **streaming model **to count the so-called **Heavy hitters. Heavy hitters** refer to elements from the stream that occurs **more frequently** than others.

**Let me first define the streaming model**: In this model, we have a sequence of items (I’ll use the letter *m *for the number of* *items) that are streamed to a computing unit, where the computer has** no information** about the data coming in.

One interesting property of a stream of items is the **number of distinct elements** in it. You might think *“That’s easy to compute, we can just keep track of all the items coming from the stream, and then count how much distinct element there is!”, *but such an approach would require **storing all of the stream items**. We could also just keep a counter for every item of the stream, but in the case where every item is distinct, it would end up being the same problem as storing all elements.

In fact, in the streaming model, we say that the number of item ** m** in the stream is far greater than what a computer can store. It’s a bit like if you would look at the river with objects floating on it: You can look at them go by, maybe grab one or two, but you probably don’t have enough hands to grab all of them.

Then how does one count all the distinct elements, with a little memory space available? It is possible to **estimate** which elements are **heavy-hitters**, with the **Misra-Gries** algorithm. With this algorithm, one can estimate the frequency of items, and even better, **find all elements that occur more than m/k times in the stream, for a defined k.** For instance in the array [1,1,2,3,4,5,1,1,1,5,3,3,1,1,2] with

**, we want to find element that occurred more than**

*k=3*

*15/3 = 5**times.*

Here are some other notations: we use a stream ** S**, of size

**, which contain a list of elements**

*m***coming from a universe**

*aₓ*

*[n] = {1,2,…,n}**(Each element aₓ*

*can have a value taken from the set {1,2,…n}).*

A **one-pass **algorithm counting all elements would require ** Ω(m)** space in the computer, which is not what we want. But by estimating the frequency of an element with the Mesa-Gries algorithm, the space usage can be bounded by

**O(**. The estimation done by this algorithm allows for a one-sided error. This means that even if all the most frequent items are in the output at the end, there could also be some

*k*( log(m)+log(n)))***infrequent elements**.

## The algorithm

The algorithm itself is quite simple. Suppose that we have a stream ** S** with

**elements coming from it. We use a dictionary**

*m***that will hold counters for the streamed items.**

*D***Initialize a dictionary DFor each item a coming from the stream S:▹ If a is in D, then D[a]++▹ Else if the size of D is less than k-1, insert a in D with value 1▹ Else decrement all counters in D by 1, and remove all elements with count 0**

## Why does it work

Let **f̂ **be the estimate of the frequency of a certain item, and

**f**the real frequency of this item in the stream

**. We can claim that**

*S*

*f - m/k**≤*

*f̂**≤*

**That is, the estimated frequency**

*f.***of**

*f̂***the item is greater or equal to the real frequency**

**of the item, minus**

*f*

*m/k.*Suppose ** e** is an element with a frequency greater than

**m**.

*/k*When an occurrence of the element ** e** is not counted, it can be for two reasons:

- Because we try to insert it in
**D**, but there is already**k-1**items in it - Or because another insertion happens when
is already full, that decrement the count of*D*.*e*

Since ** |D| = k-1**, there can be at most

**m**occurrence for such an event.

*/k*Therefore if ** e** is still in

**D**by the end of the algorithm, it occurred more than

**m**times.

*/k*## Algorithm space bounds

There is at most ** k-1** counters in

**(that we can simplify to**

*D***). For each counter, we hold a key that can be from**

*k***to**

*1***, and a corresponding value that can be from**

*n***to**

*1***.**

*m*Storing a key ** n** require

**space (think of binary representations), and a counter**

*log(n)***requires**

*m***space. So one key-value pair represent**

*log(m)***space.**

*log(n)+log(m)*Since we have k-1 keys, we end up with a higher bound **O( k*( log(m)+log(n)))** for the algorithm space usage.

## Limitations

The Misra-Gries algorithm is a simple algorithm that works just fine to estimate heavy-hitters, but as I mentioned above, it can also output infrequent items. It is possible to modify the algorithm, or even use different algorithms that can estimate the frequency of heavy-hitters, without outputting infrequent items.

If you are interested in learning more, I recommend reading about streaming algorithms and data-structures — such as Bloom filters and Count-min sketches.

Please let me know if you have any questions or comments about this article! Cheers, Stéphane