Kafka, proxies and service meshes

We’ve seemed to have always had distributed applications but today, with the tendency to limit the size of services (i.e. direction Microservices), they are even more widespread. It’s quite common, on projects I work on, that we have a need for certain services to publish REST APIs e.g. supporting synchronous calls for Angular applications and others to work asynchronously on background tasks e.g. services consuming from and producing to Kafka topics. The applications typically run on a container platform like Kubernetes (or it’s close relative Openshift). When you’ve such a constellation somewhere in the back of your mind is the question “Would a service mesh make my life easier or harder?” quickly followed by “Service meshes are for http traffic right? where does Kafka fit in?” Regardless of the answer to the first question, trying to answer the second one is interesting.

To begin with, what is a service mesh?

A service mesh is a dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable [1]

You can imagine your application as consisting of multiple services. In the worst scenario it’s a fully connected network however if this is the case you might ask the question “Have I drawn my service boundaries correctly and should I maybe consider avoiding a distributed monolith?” Regardless we are now in a distributed world and each node is responsible for dealing with distributed problems.

Assuming your services are cohesive and independent, we can reap the benefit of allowing each development team to make their own decisions on choosing which stack would best serve their purposes. This enables them to deliver quality software quicker to the customer.

However does this mean every single service has to reimplement its own tracing, metrics and logging libraries? You may counter-argue “That’s why we use Springboot” but does that help the Dotnet Team? Is there somewhere we could place such common concerns and avoid solving them independently multiple times. In a proxy maybe?

Gwen Shapira [2] describes the proxy as the fundamental component of a service mesh. Proxies are omnipresent in our lives as application developers. Reverse proxies help with things like load balancing, authentication, firewall rules while forward proxies can control access to external sites, perform TLS termination and deep packet inspect on corporate networks. In service meshes all inbound and outbound traffic is routed through a proxy. Matt Klein from Lyft describes the list of architectural issues which motivated their creation of envoy, a popular service mesh proxy. [3]

Multiple Languages and frameworks.
Many Protocols (HTTP/1, HTTP/2, gRPC, databases, caching, etc.).
Black box load balancers (AWS ELB).
Lack of consistent Observability (stats, tracing, and logging).
Partial or no implementations of retry, circuit breaking, rate limiting, timeouts, and other distributed systems best practices.
Minimal Authentication and Authorization.
Per language libraries for service calls.
Extremely difficult to debug latency and failures.
Developers did not trust the microservice architecture.

The idea is each service should only be concerned with implementing business functionality. The network should be transparent to them and when you have issues in production they should be easy to trace. Werner Vogels, AWS’s CTO, talks about undifferentiated heavy lifting. To provide this is the goal of proxies in a service mesh, solving difficult distributed system problems, in one place, so that the services themselves can concentre on providing business value. Now we can modify our diagram above to include proxies for every service.

In Kubernetes, it is not enough to have a proxy container, in a pod, running along side the application container. Such a setup is called a sidecar proxy, but with it, we will never influence any traffic bound for the service. We need a mechanism to configure the pod’s network and divert traffic to the proxy. Kubernetes provides us with Init [5] containers. An Init container starts and runs to completion before your application container starts. It’s useful for setup work needed by your pod, for example applying routing rules using iptables (more details in the first example below).

The final piece of the puzzle is the control plane. Proxies live in what is called the data plane of a service mesh along with your application containers. However to have some runtime insight, configure what is happening in the data plane, we need an additional plane, a control plane. To take a few examples the control plane is responsible for things like certificate management e.g. in mTLS, directing traffic e.g in canary testing or collecting telemetry data. The combination of both planes gives you a service mesh.

Rolling your own Dataplane

Now that we have an understanding of what a servicemesh is, how might we create our own dataplane? Venil Noronha describes [4] nicely how you might craft your own sidecar proxy. Using Kubernetes as the platform you’ll need :

  • An Init Container where we can apply configuration scripts.
  • To configure the pod’s network we need an iptable rule. Below we route all traffic bound for port 8080 to port 8000

iptables -t nat -A PREROUTING -p tcp -i eth0 --dport 8080 -j REDIRECT --to-port 8000

  • Finally you need a proxy listening on port 8000, and forwarding traffic to port 8080. In my example below the proxy is written in Go

Here we have an iptables rule which divert packets through the proxy. The proxy checks each inbound http request and decides whether to rewrite it’s http body or not. The Springboot service is oblivious to the existence of the proxy. You can find more implementation details in the Github project below.

Before leaving http traffic, if you’re interested in more details on Kubernetes networking Kevin Sookocheff’s “A Guide to the Kubernetes Networking Model” [6] is a comprehensive read.

Kafka, “We’re not in Kansas anymore”

But Kafka[5] is not http. Kafka stores messages (aka events or records) in a durable way, scales really well, has streaming libraries for aggregating messages over windows plus connector libraries for integrating with other sources/sinks such as databases. If you’re new to Kafka it’s worth taking a look at Apache’s quickstart and Tim Berglund from confluent gives a good high level view of the things a Kafka platform can offer in the video “ What is Apache Kafka”.

Kafka is a binary protocol consisting of request response pairs. Examples of request types include produce requests to publish messages and fetch requests to consume messages. As of now there are 48 message pairs (compared to 7 http methods). Each message pair is independently versioned. Being a binary protocol helps efficient transfer of data from producers and consumers (Kafka clients) to the broker (a process on the Kafka cluster responsible for storing messages to its append-only distributed commit log).

However the fact that it’s a rich binary protocol means more work parsing the requests and responses. It’s not around as long as http, so while listening and parsing http traffic is a matter of importing the library (as in our Go proxy above/usr/local/Cellar/go/1.13.3/libexec/src/net/http ) parsing Kafka requests requires a little more effort.

Work to date on Kafka Proxies

There has been some work by different people in the past few years and this is not an exhaustive list but heres what I’ve stumbled across. David Jacot [9] gave a really interesting talk at the Bern Streaming Meetup group demoing how you might implement GDPR compliance using Kafka Proxies. Prior to that Travis Jeffery wrote an implementation of Kafka and a Proxy in Go [10]. Banzaicloud took a slightly different angle [11] and ran a Kafka cluster on Istio. And finally today Adam Kotwasinski is busy implementing a Kafka filter for envoy [12]

Rolling your own Kafka Proxy

In order to test out a Kafka Proxy for myself one option was to update the existing Go libraries to support the current version of Kafka (so reading the Kafka protocol documentation very closely :-) ) Another was the Apache Kafka [13] project itself defines the Kafka protocol with json schema and autogenerate their associated Java objects, wrapping them all in a kafka-clients libary. So while I tried the former, I ended up going with the latter in conjunction with netty.

Here we have Go and Dotnet producers sending their message values in plaintext. On the other side we’ve a Springboot consumer consuming the messages with their values again in plaintext. However the values of the messages themselves are encrypted and decrypted in transient by the proxy. So while the Kafka logs contain encrypted data the Kafka clients are again completely ignorant of the encryption mechanism. More implementation details in the Github link below.

Example of results:

Below we have both the Dotnet and Go logs. Tailing them we see the message values printed in plaintext.

tail -f donet-producer.log go-producer.log

Similarly in the consumer logs the message values are printed in plaintext.

tail -f java-consumer.log


The Kafka broker is running on a docker container. Using the docker execcommand we can log directly into a bash terminal. From here (using the Kafka console consumer) messages can be consumed without going through the Kafka proxy. Therefore there is no modification of the Kafka protocol in transient and we can clearly see the message values are encrypted.

docker exec -it broker_kafka_1 bash /opt/kafka_2.12-2.3.0/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic villager --group broker-consumer --property print.key=true --property key.spearator="-"

Finally taking a glimpse into the the proxy logs we can see Kafka requests and responses flowing by. In the screenshot below we see a Produce request in transit.

  • A Kafka request with an API key of 0, i.e. a Produce request
  • The version of the Produce request is 7
  • The client is listening on the ephemeral port 52349 and this is the 4th request sent by the client, i.e. the correlation ID is 4
  • The destination topic is villager with a key donet* and value of Gloin

*Note: typo in produce client, the key should read dotnet

This is the request payload as it arrives at the proxy before it is encrypted. Afterwards a new Kafka Produce request is created , with an encrypted value and forwarded to the Kafka broker.

tail -f Kafka-proxy.out


Hopefully we have gone some way to answering the second question at the start of this article , “Service meshes are for http traffic right? where does Kafka fit in”. On the first question “Would a service mesh make my life easier or harder?”.

I’ve seen complex service mesh iptable routing diagrams show as a reason against using services meshes but Kubernetes itself already maintains iptables rules to track pods associated with a service and we’re happy to use them.

I’ve seen diagrams similar to the control plane diagram above with the control plane replaced by the “out of fashion” SOA . Here I would argue the difference is that SOA ESB (Enterprise Service Bus) combines business logic with integration, its a “smart pipe”. This is definitely not the goal of service meshes. Your business logic should remain in the application containers. Comparing with Kafka, this is one of its selling points, Kafka is a “dumb pipe”, with no business logic. Kafka doesn’t try to do everything but what it does, it does really well. This should also be the case for service meshes.

Performance concerns have also been cited but banzaicloud [14] showed (on their setup) running on Istio is actually a performance help rather than a hinderance when it comes to TLS traffic.

Regardless of these arguments and the potential benefits, it does add complexity to your production environment. Any service mesh would have to have bullet proof reliability and with Kafka we are a bit away from that yet. Also your application architecture has to benefit from it, which would be less likely if it were a monolith. All in all I think Thoughtworks [15] position makes the most sense “Trial” . By kicking the tires of Istio, Consul or something similar, in the worse case scenario you learn more about Kubernetes and distributed systems, both of which will be with us for a while.

Some References

[1] https://buoyant.io/2017/04/25/whats-a-service-mesh-and-why-do-i-need-one/

[2] https://www.slideshare.net/gwenshap/gluecon-kafka-and-the-service-mesh

[3] https://www.infoq.com/presentations/lyft-service-mesh/?utm_source=youtube&utm_medium=link&utm_campaign=qcontalks#downloadPdf

[4] https://venilnoronha.io/hand-crafting-a-sidecar-proxy-and-demystifying-istio

[5] https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

[6] https://sookocheff.com/post/kubernetes/understanding-kubernetes-networking-model/

[7] https://kafka.apache.org/protocol#protocol_api_keys

[9] https://www.confluent.io/kafka-summit-lon19/handling-gdpr-apache-kafka-comply-freaking-out/

[10] https://github.com/travisjeffery/kafka-proxy

[11] https://banzaicloud.com/blog/kafka-envoy-protocol-filter/

[12] https://github.com/envoyproxy/envoy/issues/2852


[14] https://banzaicloud.com/blog/kafka-on-istio-performance/

[15] https://www.thoughtworks.com/radar/techniques/service-mesh