# The streaming model, and how to estimate the most frequent elements with the Misra-Gries algorithm.

Apr 22 · 5 min read

Hey everyone! I’m taking my computer science master degree, and currently studying some algorithms that are not very spoken about in the mainstream literature (aka. medium articles and other blog posts). If you would like to learn about algorithms that are not so much spoken about but don’t have the patience to dive into thick research papers, this article is for you.

This post is about the Mesa-Gries algorithm. It is a rather simple algorithm, that is used in the streaming model to count the so-called Heavy hitters. Heavy hitters refer to elements from the stream that occurs more frequently than others.

Let me first define the streaming model: In this model, we have a sequence of items (I’ll use the letter m for the number of items) that are streamed to a computing unit, where the computer has no information about the data coming in.

One interesting property of a stream of items is the number of distinct elements in it. You might think “That’s easy to compute, we can just keep track of all the items coming from the stream, and then count how much distinct element there is!”, but such an approach would require storing all of the stream items. We could also just keep a counter for every item of the stream, but in the case where every item is distinct, it would end up being the same problem as storing all elements.

In fact, in the streaming model, we say that the number of item m in the stream is far greater than what a computer can store. It’s a bit like if you would look at the river with objects floating on it: You can look at them go by, maybe grab one or two, but you probably don’t have enough hands to grab all of them.

Then how does one count all the distinct elements, with a little memory space available? It is possible to estimate which elements are heavy-hitters, with the Misra-Gries algorithm. With this algorithm, one can estimate the frequency of items, and even better, find all elements that occur more than m/k times in the stream, for a defined k. For instance in the array [1,1,2,3,4,5,1,1,1,5,3,3,1,1,2] with k=3, we want to find element that occurred more than 15/3 = 5 times.

Here are some other notations: we use a stream S, of size m, which contain a list of elements aₓ coming from a universe [n] = {1,2,…,n} (Each element aₓ can have a value taken from the set {1,2,…n}).

A one-pass algorithm counting all elements would require Ω(m) space in the computer, which is not what we want. But by estimating the frequency of an element with the Mesa-Gries algorithm, the space usage can be bounded by O(k*( log(m)+log(n))). The estimation done by this algorithm allows for a one-sided error. This means that even if all the most frequent items are in the output at the end, there could also be some infrequent elements.

The algorithm itself is quite simple. Suppose that we have a stream S with m elements coming from it. We use a dictionary D that will hold counters for the streamed items.

Initialize a dictionary D
For each item a coming from the stream S:
▹ If a is in D, then D[a]++
▹ Else if the size of D is less than k-1, insert a in D with value 1
▹ Else decrement all counters in D by 1, and remove all elements with count 0

Let be the estimate of the frequency of a certain item, and f the real frequency of this item in the stream S. We can claim that f - m/k f. That is, the estimated frequency of the item is greater or equal to the real frequency f of the item, minus m/k.

Suppose e is an element with a frequency greater than m/k.

When an occurrence of the element e is not counted, it can be for two reasons:

• Because we try to insert it in D, but there is already k-1 items in it
• Or because another insertion happens when D is already full, that decrement the count of e.

Since |D| = k-1, there can be at most m/k occurrence for such an event.

Therefore if e is still in D by the end of the algorithm, it occurred more than m/k times.

There is at most k-1 counters in D (that we can simplify to k). For each counter, we hold a key that can be from 1 to n, and a corresponding value that can be from 1 to m.

Storing a key n require log(n) space (think of binary representations), and a counter m requires log(m) space. So one key-value pair represent log(n)+log(m) space.

Since we have k-1 keys, we end up with a higher bound O(k*( log(m)+log(n))) for the algorithm space usage.

The Misra-Gries algorithm is a simple algorithm that works just fine to estimate heavy-hitters, but as I mentioned above, it can also output infrequent items. It is possible to modify the algorithm, or even use different algorithms that can estimate the frequency of heavy-hitters, without outputting infrequent items.

If you are interested in learning more, I recommend reading about streaming algorithms and data-structures — such as Bloom filters and Count-min sketches.

## Nerd For Tech

From Confusion to Clarification

## Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To stay up to date on other topics, follow us on LinkedIn. https://www.linkedin.com/company/nerdfortech

Written by

## Stéphane Guichard

CS MSc Student in Copenhagen - Tech & Innovation lover https://www.linkedin.com/in/steph-guichard/?locale=en_US

## Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To stay up to date on other topics, follow us on LinkedIn. https://www.linkedin.com/company/nerdfortech

## The magnificient power of data exploration

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app