Stream Processing with Faust

Ryan Whitten
The Pixel

--

Here at Pixability, we deal with lots of data every day. In our 11+ year quest to help brands and advertisers succeed on YouTube, one of our largest data sets is YouTube video and channel statistics. Our data science team is also constantly building new artificial intelligence (AI) and machine learning (ML) models to analyze these data in new ways to ultimately help advertisers realize their full potential across YouTube, Facebook, Instagram, and Connected TV.

With an increasing need to run our video data through these data science models and later join the new output data with our existing set, we sought a flexible and scalable solution to process millions of videos and channels every day. To do this, we developed a stream processing pipeline that lets us quickly build and deploy new models to production, often in a matter of hours.

Choosing a Stream Processing Framework

We use Apache Kafka as a message broker to shuttle harvested data to different places, usually ending up in our data warehouse. Our YouTube data is no different — in its simplest form we harvest data from the YouTube API, publish it to Kafka, and store it in our warehouse.

For one specific project to quickly build something that could more deeply categorize our video dataset, we built a collection of stream processing applications to intercept the video data before it got to the warehouse, run the data through the categorization models, and publish the results to an entirely different set of Kafka topics.

Since we are mostly a Python shop, we decided to try the relatively new Faust framework to do the heavy lifting. We have experimented with stream processing before with Apache Spark, but that seemed like overkill for this project. We’ve also used our own DIY Python Kafka consumer scripts in the past, but they lack robust fault tolerance and horizontal scalability. Faust seemed like a great choice because it was natively Python, limiting the time for our team to learn a new language and environment, and it had many more features that we may want to use in the future since it is modeled closely after Kafka Streams. It was also developed by a team at Robinhood, including the creator of the popular Celery project, which gave us the confidence that Faust could be used in a production environment.

Getting Up and Running

After we decided to explore Faust, we were pleased to find that it’s incredibly easy to set up and use. At its core, Faust has all of the built in functions to connect to a Kafka source topic, start consuming messages (including options for windowing), and publish data to new (or existing) topics.

Knowing that we had many different AI models we needed to start running videos through, we worked with our Data Science team to create one generic scoring function that could be used to process either channels or videos, and take in the desired model file.

We followed a similar approach with building Faust “apps”. We were able to define only one distinct app which received a variable that told it which “model class” to load. We used basic Python object-oriented programming to define classes that held all settings for a particular model, including the source topic, output topic, path to model file (in Amazon’s S3), and any other settings needed for the generic scoring function. The basic process is (as shown below): consume messages from source topic, pass a batch of windowed records to generic scoring function, publish returned results to Kafka output topic, and finally load those records into our data warehouse using our standard Kafka sink connector.

Moving Forward

After the initial development time to get this project off the ground, we can now deploy new models to production in as little as a few hours (including time for testing). Given the ease of development with Faust on this project, we have decided to use it for another project, this time much larger in scale. If you are looking to get into stream processing and want to stay within the Python ecosystem, look no further than Faust. You can also find a stripped down example of our project here.

--

--

Ryan Whitten
The Pixel

Data Engineering @ Best Egg ❤ Python, AWS, Serverless