Top K Frequent Items in a Time-sliding Window

4 min readApr 10, 2020

A sample of the most popular English words from the Twitter for the last 10 minutes

TL;DR

I have demonstrated a “Top frequent items” algorithm with Twitter stream. You can checkout the project source code from : https://github.com/efekaptan/top-k

Introduction

Finding the top elements from a finite set or a live stream is a straightforward process. But things become challenging when your calculation depends on a sliding time interval. Finding the “top K frequent words from the Twitter stream for the last 10 minutes” can be a good example for such a problem.

Can we find an algorithm that provides fast insertion(O(1)), fast deletion (O(1)) and fast selection(O(1)) within a data-stream? At first, I have decided to use Heap data structure for fast insertion and selection but realised that heaps are not suitable for frequent delete operations. As each deletion of the heap requires a “heapify” operation which has O(logn) cost.

I did some research about the topic and digged into a couple of data-streams algorithms and came up with a solution that can perform all operations effectively. I assume that the sliding part of the data can fit into the memory.

Algorithm

The algorithm is rather simple, combinations of two linked lists + one hash table. Before diving into the details, let’s look at the visualisation of the algorithm:

Each count node (red color) has a head pointer which points to a word node. The words which have the same frequencies create a double linked list and each node points to the count node also. Whenever we receive a new word from the stream, we first check the hash table to find the node reference. Then, the node moves to an upper “count” node which provides us with a fast insertion.

The delete operation works in a similar way. The word comes from the “deletion” stream and moves to a lower count node.

The selection of the top K items is straightforward. Start from the highest frequency count node, select the head node and move to the next one.

Implementation

I used Java for backend services and React for the frontend application. Kafka was chosen as the messaging system.

The main components of the system are :

Tweet Exporter (which subscribes to the Twitter stream and sends tweets to Kafka)
Term Exporter (which receive tweets from Kafka and tokenize into words)
Frequency-api (which receives the words from Kafka and serves the frequency result)
Frontend (which visualises the result)

A couple of implementation notes:

I used the official Twitter api for the tweet stream. The stream was filtered to receive English tweets only. The system can be extended to add another exporter(producer) with a different language also.
Term exporter does basic English tokenization for each tweet. Removes stop words, urls, emojis, “RT”, mentions and non-alphanumeric characters.

String tweet = "RT @gorillaz: Meanwhile, at Kong Studios... \uD83E";
List<String> expected = Arrays.asList("kong", "studios");
Assert.assertEquals(expected, getTokens(tweet));

The words are being queued into two Kafka topics. One for insertion and one for deletion. Each word has a Timestamp value in order to check 10 minutes interval before deletion.
Each “Frequency Api” consumes Kafka topics with a different groupId. This allows us to maintain data replication on each api to provide high availability.
Frontend polls top K words every 10 seconds. In addition, consumes the word stream with a WebSocket connection. I reduced the WebSocket stream %90 to be able to see the words in real-time (otherwise words slide extreme fast).

Metrics

Some metrics about this demonstration:

Rate of the tweet exporter is 15 tweets/second (English only)
Rate of the term exporter is 80 words/second
In average, 15.000 words exist in the counter system (api memory) within 10 minutes interval.

Conclusion

I tried to demonstrate a simple approach for this famous “top K frequent items” problem. The next questions should be :

What if my sliding data doesn’t fit into the memory?
How to calculate if the interval is larger than 10 minutes, for example : the last hour or the last day?
How to scale if the data-stream contains massive amounts of data?

The answers to these questions are a good candidate for a new article.

You can check out the project source code from: https://github.com/efekaptan/top-k

I would appreciate if you could share your thoughts about this demonstration with me. Your feedback is always welcome. Thanks for reading.