Top K Frequent Items in a Time-sliding Window
TL;DR
I have demonstrated a “Top frequent items” algorithm with Twitter stream. You can checkout the project source code from : https://github.com/efekaptan/top-k
Introduction
Finding the top elements from a finite set or a live stream is a straightforward process. But things become challenging when your calculation depends on a sliding time interval. Finding the “top K frequent words from the Twitter stream for the last 10 minutes” can be a good example for such a problem.
Can we find an algorithm that provides fast insertion(O(1)), fast deletion (O(1)) and fast selection(O(1)) within a data-stream? At first, I have decided to use Heap data structure for fast insertion and selection but realised that heaps are not suitable for frequent delete operations. As each deletion of the heap requires a “heapify” operation which has O(logn) cost.
I did some research about the topic and digged into a couple of data-streams algorithms and came up with a solution that can perform all operations effectively. I assume that the sliding part of the data can fit into the memory.
Algorithm
The algorithm is rather simple, combinations of two linked lists + one hash table. Before diving into the details, let’s look at the visualisation of the algorithm:
Each count node (red color) has a head pointer which points to a word node. The words which have the same frequencies create a double linked list and each node points to the count node also. Whenever we receive a new word from the stream, we first check the hash table to find the node reference. Then, the node moves to an upper “count” node which provides us with a fast insertion.
The delete operation works in a similar way. The word comes from the “deletion” stream and moves to a lower count node.
The selection of the top K items is straightforward. Start from the highest frequency count node, select the head node and move to the next one.
Implementation
I used Java for backend services and React for the frontend application. Kafka was chosen as the messaging system.
The main components of the system are :
- Tweet Exporter (which subscribes to the Twitter stream and sends tweets to Kafka)
- Term Exporter (which receive tweets from Kafka and tokenize into words)
- Frequency-api (which receives the words from Kafka and serves the frequency result)
- Frontend (which visualises the result)
A couple of implementation notes:
- I used the official Twitter api for the tweet stream. The stream was filtered to receive English tweets only. The system can be extended to add another exporter(producer) with a different language also.
- Term exporter does basic English tokenization for each tweet. Removes stop words, urls, emojis, “RT”, mentions and non-alphanumeric characters.
String tweet = "RT @gorillaz: Meanwhile, at Kong Studios... \uD83E";
List<String> expected = Arrays.asList("kong", "studios");
Assert.assertEquals(expected, getTokens(tweet));
- The words are being queued into two Kafka topics. One for insertion and one for deletion. Each word has a Timestamp value in order to check 10 minutes interval before deletion.
- Each “Frequency Api” consumes Kafka topics with a different groupId. This allows us to maintain data replication on each api to provide high availability.
- Frontend polls top K words every 10 seconds. In addition, consumes the word stream with a WebSocket connection. I reduced the WebSocket stream %90 to be able to see the words in real-time (otherwise words slide extreme fast).
Metrics
Some metrics about this demonstration:
- Rate of the tweet exporter is 15 tweets/second (English only)
- Rate of the term exporter is 80 words/second
- In average, 15.000 words exist in the counter system (api memory) within 10 minutes interval.
Conclusion
I tried to demonstrate a simple approach for this famous “top K frequent items” problem. The next questions should be :
- What if my sliding data doesn’t fit into the memory?
- How to calculate if the interval is larger than 10 minutes, for example : the last hour or the last day?
- How to scale if the data-stream contains massive amounts of data?
The answers to these questions are a good candidate for a new article.
You can check out the project source code from: https://github.com/efekaptan/top-k
I would appreciate if you could share your thoughts about this demonstration with me. Your feedback is always welcome. Thanks for reading.