Tweets hashtags real-time processing with Faust

Ignacio Peluffo
Jul 3, 2019 · 4 min read

What is Faust?

Faust is described in a short and simple way by its creators as:

Faust is a stream processing library, porting the ideas from Kafka Streams to Python.

And the best part:

It does not use a DSL, it’s just Python! This means you can use all your favorite Python libraries when stream processing: NumPy, PyTorch, Pandas, NLTK, Django, Flask, SQLAlchemy, ++

Robinhood, the company behind Faust, are using Faust in production and they have open sourced the library making public its repo in Github: https://github.com/robinhood/faust.

Faust is an incredibly easy and powerful tool to process streams in real-time allowing to process millions or even billions events per day thanks to its implementation using Python asyncio and Apache Kafka as the message broker.

In this post, we’ll build a simple app to process tweets in real-time from a Twitter stream, filtering it by a specific criteria, in order to count the hashtags found in them. The objective is to show how easy and quick is to build a stream processing project made purely in Python.

What are we building?

The idea of this project is simple: read tweets from a Twitter stream, send their hashtags as events and count them using Faust agents.

What are we going to do? Build the following:

  1. Custom Faust CLI command
  2. Faust agent to process events
  3. Faust views

Here is a simple diagram that shows how the different components interact between each other:

Note: the purpose of this article is not to explain in detail the Faust library and its components, but give an idea of how easy is to build a stream processor with it.

Custom Faust CLI command

Faust provides very handy tools to build our own custom CLI commands with parameters.

For this project, we’re going to build a CLI command to read tweets from the Twitter real-time stream, process them and generate one event per hashtag taken from them.

Since Faust uses asyncio and its new async/await syntax, we require to use a Twitter client that supports the same. For this project I’ve chosen peony-twitter which is an asynchronous Twitter client made in Python.

Below is the command’s code where all the logic is split in function for better readability and make it simpler to understand the idea:

The command hashtags_events_generator is responsible for filtering the Twitter stream using the value of the parameter track that is sent to it.

Then, it processes all the tweets that match with the filter and sends one event per hashtag found. This point is very important since Faust currently uses Kafka as the main broker for sending messages.

Faust agent to process events

The second step is to build a Faust agent, which is one of the most important concepts in the framework, to process the Kafka events and count the hashtags.

Also, for this implementation, we use Faust tables which is another crucial concept that provides a great flexibility to store data and share it across multiple workers running in parallel. I encourage you to read the documentation to understand all the details and potential of Tables.

A code snippet that shows how the agent could be implemented is shown below:

In this piece of code I included the agent, the topic and table definitions. Their definitions should be done in different modules for a better code organisation.

From this code it’s simple to see how easy is to build the Kafka topic consumer which requires the following:

  1. Use a decorator that specifies the topic that the agent should consume from
  2. Iterate over the stream (i.e. hashtags )
  3. Process the event, in this I’m counting the hashtags, which is the event’s body, and storing the total count in the table hashtags_counts_table

The best part is that this can scale really easily running multiple workers at the same time and it took only a few lines of code.

It’s important to note two details that make the implementation simpler:

  • The table has a default so we don’t need to check the existence of the key in it
  • We set a function to automatically parse the messages’ value received from the topic

Faust views

Up to this point, we’ve been sending events and processing them but we couldn’t get any information like which hashtags the agent has processed.

For this reason, we build a tiny API that exposes two endpoints using Faust Views, another useful tool provided by Faust to build APIs in a few lines of code that could be helpful for some specific use cases:

  1. /hashtags : returns all hashtags processed at the moment
  2. /{hashtag}/count : returns the current count for the hashtag sent in the path parameter

Example response from /hashtags :

{
"hashtags": [
"python",
"ilovepythonmore",
"programming"
]
}

Example response from /hashtags/python/count :

{
"hashtag": "python",
"count": 7
}

Demo

Below is a screen capture showing workers processing events in real-time:

Source code

You can find the source code for this project in the Github repo: https://github.com/ipeluffo/faust-hashtags-counter

In the repo’s readme you’ll find all the instructions to run the workers and setup the whole environment. Also, I’ve provided two docker-compose files so you can run Kafka with Zookeeper locally.

Conclusions

As mentioned at the beginning of this article, Faust has a lot of potential providing a vast amount of tools for Python developers to build real-time stream processing applications with a clean and simple API.

One of the best things is that it’s really easy to integrate Faust with very-well known tools such as Django or Flask.

I hope you enjoyed the article and thanks for reading.

Links

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade