Continuous NLP Pipelines with Python, Java, and Apache Kafka

Victor Künstler
bakdata
Published in
12 min readJul 6, 2020

Photo by Joao Branco on Unsplash

Advancements in machine learning, data analytics, and IoT, and the business strategic shift towards real-time data-driven decision making, increase the demand for stream processing. Apache Kafka and Kafka Streams experience rising popularity as a solution to build streaming data processing platforms.

Natural language processing (NLP) can retrieve valuable information from texts and is a typical task tackled on such platforms.

TL;DR We implemented common base functionality to create production NLP pipelines. We use Avro, large messages, and error handling with modern monitoring and combine powerful Python libraries with Java. Read on to learn about the technical foundation we created and the libraries we share.

Because Kafka Streams, the most popular client library for Kafka, is developed for Java, many applications in Kafka pipelines are written in Java. However, several popular libraries for NLP are developed in Python (spaCy, NLTK, Gensim, …). These libraries have large communities behind them and are very popular. Thus, using Python including such NLP libraries as part of streaming data pipelines is aspired to achieve excellent results. Consequently, working with streaming applications written in Java and Python seamlessly in a modern streaming data pipeline becomes necessary.

Using Python and Java for streaming applications in combination implies that applications are decoupled. Then, deployment processes, error handling, and monitoring are more complex. Also, the consistency and (de-)serializability of the data consumed and produced by applications has to be language-agnostic.

In this blog post, we illustrate how we laid down the foundation to develop NLP pipelines with Python, Java, and Apache Kafka. We discuss all the above-mentioned challenges and showcase common utility functions and base classes to develop Python and Java streaming applications, which work together smoothly. We cover the following topics:

  • Developing, configuring, and deploying Kafka applications written in Python and Java on Kubernetes
  • Using Avro for serialization in Java and Python streaming applications
  • Managing errors in a common way using standardized dead letters