Building a data engineering project. Part 3 — Kafka, containers, and portable software

Kirill Zaitsev
7 min readFeb 19, 2022

--

Credit: WhiteSource

From Part 2, we have our system up and running locally. The solution is functional, so we must be satisfied.

Wait. Remember the burden of setting up your Java version properly for the first time? And Python? And Kafka? What about other developers who want to work with our system?

It is necessary to automate the launching process of your project as much as possible to make it maintainable and reusable by others. Containers are foundational to achieving this goal and are vital for a microservice application. In this part of the series, I suggest taking a closer look at this technology while embedding another crucial part of the pipeline — the ingestion buffer.

With this, I suggest for us the following plan:

  1. Real-time ingestion
  2. Kafka and its place in our pipeline
  3. Containers, Docker, and Docker Compose
  4. Containerized Kafka cluster
  5. Ready to deploy!
  6. Project summary and takeaways

To quickly navigate the project, here is an outline:

Real-time ingestion

First of all, what does real-time mean? Wiki (https://en.wikipedia.org/wiki/Real-time_data) puts it as ‘immediately delivered information after its creation.’ Does data transfer to a destination instantaneously? Wait, should raw data also be preprocessed before getting into a database? And if there are lot of real-time data?

The answer to the first question is quite a helpful thing for ‘every programmer to know’ — https://gist.github.com/jboner/2841832. And the rest we will handle gradually in this article.

For a primer on real-time systems, I encourage you to this article https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/ by Tyler Akidau.

In our case, we are streaming tweets in real time using Twitter API. The implementation of our small processing engine doesn’t ‘remember’ how it got to the current state of the inverted index. In other words, we don’t log the operations happening. If something fails — we start from scratch! Not good if fault tolerance is a requirement. This is where the message buffer becomes useful. We simply store our data there and may replay everything on demand!

For a more in-depth explanation of why decoupling ingestion from processing is mandatory in the streaming case, do familiarize yourself with this explanatory article.

Kafka

There are a ton of message-buffering solutions, yet one seems to be ahead of the horde — Kafka. Highly scalable, fault-tolerant, and performant messaging system, built on top of the Pub/Sub paradigm widely adopted in the big data industry.

The notion of offsets and subscriber groups will help us achieve the replay functionality we need to guarantee that no tweets are lost if our system breaks (for example, the server got stuck because of ~100% RAM utilization). Plus, Kafka is a natural choice for applications of any scale. And it’s relatively straightforward to add it as part of our microservice app.

Initially, Kafka powered the Hadoop ecosystem in streaming use cases. Now we are free to set up a cluster on our own and play with the tool locally, which we are sure to do.

Containers, Docker, and Docker Compose

Bash scripting that makes things up and running on the spot is a commonly adopted approach. Commonly adopted for prototyping on your local machine. No package conflicts, permissions are all set up, and machine resources are available. In production, this is somewhat of a miracle to get this right from the start. The environment is shared by multiple applications, stored sensitive data should not be accessible to any software, new servers cannot be easily added so existing resources have to be used sparingly, and many others.

Containers, similarly to virtual machines (read on the difference), are a means to isolate the software from the bare metal. Isolation gives you a certain level of control over such details as the right environment for your software, permission management, and resource utilization.

Docker is an engine that enables us to create and execute containers. You can read more about its alternatives here. Now, imagine we have our containerized services: our primary app, the tokenizer service, Kafka cluster. Should we execute docker run -d <container name> a couple of times?

We might, but there is a better way. The approach we will employ is called container orchestration. Docker Compose is our companion here for a couple of reasons:

  • there is a single machine available (not a cluster of servers — see Kubernetes, Docker Swarm);
  • we want a container orchestration solution with minimal management overheads.

docker-composeshould be easy to set up if you have the Docker engine already running (see the installation guide). At this point, we are in the realm of multiple isolated containers running our services on a local machine.

Containerized Kafka cluster

Omitting preludes, let’s jump right into service definitions in docker-compose.yaml the file for a minimalistic Kafka cluster with our custom Kafka producer:

version: "3"
services:

zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"

kafka:
image: wurstmeister/kafka
ports:
- "9092:9092"
environment:
KAFKA_ADVERTISED_HOST_NAME: localhost
KAFKA_CREATE_TOPICS: "inverted_index_app:1:1"
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
volumes:
- /var/run/docker.sock:/var/run/docker.sock

kafka_producer:
image: kafka_producer:latest
build:
context: .
dockerfile: kafka_producer.Dockerfile
ports:
- "9092:9092"
network_mode: host
depends_on:
- kafka

Leaving out the basics (make sure to have a Compose file reference nearby), I draw your attention to the following:

  • Missing docker images. Compose won’t break if you haven’t downloaded some of the images you specified in your local Docker registry — it fetches them for you
  • Networking. By default, containers run in their dedicated network which you can change to localhost by specifying network_mode: host in the YAML. Alternatively, we could expose (“map”) just a single port from the container’s network, for example, by telling the Zookeeper service to map 2181:2181, we make the software that occupies the port 2181in the container accessible via the port 2181 on the localhost.
  • Service interdependencies. kafka_producer cannot start before the kafka service is active given the depends_on statement.

Successful running of the compose definition above will give you a picture similar to the one below:

Minimalistic Kafka cluster with a custom producer

Ready to deploy!

The rest of the docker-compose.yaml looks no more challenging than what we’ve already seen:

tokenizer:
image: tokenizer:latest
build:
context: .
dockerfile: tokenizer.Dockerfile
network_mode: "host"

main:
image: app_server:latest
build:
context: .
dockerfile: app_server.Dockerfile
network_mode: "host"
volumes:
- ./vocab/:/vocab
- ./datasets_v2/:/datasets_v2

Note a slight abuse of using network_mode: “host” . As we have discussed, this makes a service run on the localhost network. It comes in handy when there are too many port mappings to the host network one has to provision. This is not a best practice because you may have other software running on your network, leading to potential port conflicts or unauthorized access to local endpoints. Limit the reliance on the host network as much as possible in production (see this discussion).

If you keep up and have the above services up and running, we are ready for something big. Run the following command in the terminal: docker run -it — network host client:latest. In several seconds the application should be ready.

Now you can type the words that you would like to capture from real-time tweets:

war
[dataset_v2/Brhane/tweet_0.txt, dataset_v2/barti/tweet_5.txt, dataset_v2/Paul Viner/tweet_6.txt, dataset_v2/Flaubert/tweet_15.txt, dataset_v2/Marina./tweet_24.txt, dataset_v2/Joe S SDK/tweet_29.txt, dataset_v2/Joe Smith SDK/tweet_29.txt, dataset_v2/trackntrace/tweet_33.txt, dataset_v2/Hungry Keto Chef/tweet_34.txt, dataset_v2/vvictory/tweet_35.txt, dataset_v2/NO NUT ever AGAIN/tweet_39.txt]
bitcoin
[]
nuclear
[dataset_v2/EddieN/tweet_54.txt]
television
[]
president
[]
economics
[dataset_v2/NancyH/tweet_3.txt, dataset_v2/Ncontsi/tweet_22.txt, dataset_v2/Hungry Keto Chef/tweet_34.txt, dataset_v2/elke🇺🇸/tweet_38.txt, dataset_v2/Death Punch princess/tweet_40.txt, dataset_v2/Hellen/tweet_49.txt]

Project summary and takeaways

So, we have our data engineering project. Let’s do a quick review of what was achieved:

  • In Part 1, following the idea to build a tool to monitor search activity in the wild, we researched the inverted index approach
  • we created the microservice architecture, each covering a particular part of the app’s functionality, following some of the system design best practices
  • In Part 2, planning for a heavy CPU utilization, we optimized the indexing part with concurrency, explored CAS — one of the best approaches for reliable parallelized computing
  • built a client-server app that can serve several clients at once, and tested our prototype by setting up the whole infrastructure locally
  • In this part, we aimed at making our solution portable to other environments. We used some basic concepts of microservice applications: made a container for each service, automated container builds, added necessary dependencies to make the code reusable by others, and isolated the software to the container that hosts it.
  • we had a brief overview of challenges coming from streaming applications and explored specific properties of message buffers that make them essential for real-time processing
  • Using Kafka as our message bus, we integrated it as another microservice in our app
  • Finally, we launched our services by creating a Docker-Compose definition file and experimented with the application from the Client's perspective.

I challenge you to think of how you may improve the system further. For example, maybe you spotted how we put ‘raw’ ports and configs in the Compose definition? For the ‘database’ of tweets, we just put all of them in a folder in the local file system. Why not incorporate a NoSQL database which is a natural choice for real-time applications?

With this, I invite you to the next part about CI/CD pipelines: Part 4 — exploring the CI/CD landscape.

Please refer to the complete codebase for more details.

--

--