Kubernetes, Strimzi, Amazon MSK and Kafka-Proxy: A recipe for automation

Nabil Andaloussi
Data rocks
Published in
11 min readSep 28, 2020

Goals

  • Abstract database permissions, authentication and configuration requirements away from developers and service owners in a Microservices environment.
  • Allow a single and simple method for all Kafka clients (from any programming language or driver) to securely authenticate and communicate with Amazon MSK.
  • Define and create Kafka User and Topic resources from kubernetes custom resource definitions

Data Operations and Microservices

To give some context and background into some of the subjects I’ll be touching on in this post, it’s important to step back and have a look at some of the technical goals you as a Platform / Dev / Sec / Data / Wiz-Bang ops Engineer will want to achieve with any Data related project you may take on.

Most people by now are very familiar with the technical principles of DevOps… Version control, Continuous integration, Automated Deployments etc. The Goal in our realm of Data and Data platform operations is… Exactly the same? Well yes, I mean why not? The industry standard now for many organisations is to define everything as code or some sort of desired state to eliminate the need to manually create and fix things again and again or for example releasing your product or service in a new region in the cloud.

Most organisations already do that with the likes of Terraform, kubernetes, helm, Ansible, Docker etc. when you need to spin up your product or service in a new area, you change your environment variables and deploy. Job done, pack your bags, it’s home time. For most engineers, this is the easiest part of the job.

Though, I would argue that what is less easy is bringing that same paradigm into a Microservices environment where you need to assign not only resources, docker images, volumes and environment variables to services, but also manage permissions, security and access to your databases and data resources for potentially thousands of services.

In this specific case, we’ll be doing just that where a Microservices application is defined in a couple of lines of yaml, and in the background, we’ll introduce some tech to automate the access and topic management in our Kafka clusters, specifically Amazon MSK.

Architecture

Let’s begin by introducing some technologies we’ll be using.

Amazon MSK

Amazon Managed Streaming for Kafka is a fully managed service that makes it super easy to spin up production ready Kafka Clusters. We recently migrated from our own manual implementation of Kafka clusters to MSK. This has certainly benefited us by worrying less about the underlying EC2 instances, AZ zones, scaling etc.

For more information, you can read the official docs here.

It’s important to note that at the time of writing, Amazon MSK only supports TLS based certificate authentication for clients which this article will be based on.

Amazon PCA

Amazon Private Certificate Authority is a managed, highly available Private CA that allows administrators to create a complete online and offline CA hierarchy from the root through to subordinate CAs with no need for an external CA.

Do note, amazon PCA is a necessary requirement to create an MSK cluster.

For more information, see the docs here.

Strimzi Kafka Operator

The Strimzi project is a kubernetes Operator based pattern (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) that provides an easy way to run an Apache Kafka cluster on kubernetes as well as some key operator applications that aide in deploying Kafka related components (such as Kafka bridge, Kafka Connect, Kafka MirrorMaker etc)

It’s important to note, in our case we are only utilising the standalone Strimzi Topic and user Operator to manage Kafka topics and ACLs.

The Strimzi project also comes with a set of Kubernetes custom resource definitions or CRDs. These are objects, similar to that of a Pod or a deployment that we can use to define a KafkaUser or KafkaTopic, which the Operators can then interpret and manage.

Full instruction details can be found in the official documentation.

Strimzi Overview

As abstracted for the official documentation, the role of the Topic Operator is to keep a set of the KafkaTopic kubernetes CRDs describing topics that will sync with the corresponding Kafka Topics in the cluster. This allows us to declare a KafkaTopic resource as part of a service or application’s deployment configuration. The Topic Operator will then create, delete, update the Topic(s) as necessary.

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaTopic
metadata:
labels:
strimzi.io/cluster: dev-msk
name: some-topic
namespace: dev
spec:
config:
retention.ms: 604800000
segment.bytes: 1073741824
partitions: 1
replicas: 3

In the same way, a service or application will define a KafkaUser resource which will then instruct User Operator to create, delete or update a User / ACL.

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaUser
metadata:
labels:
strimzi.io/cluster: dev-msk
name: kafka.user.kafka-msk-toolbox
namespace: dev
spec:
authentication:
type: tls
authorization:
acls:
- host: '*'
operation: Read
resource:
name: '*'
patternType: literal
type: topic
type: allow
- host: '*'
operation: Write
resource:
name: '*'
patternType: literal
type: topic
type: allow
- host: '*'
operation: Describe
resource:
name: '*'
patternType: literal
type: topic
type: allow
type: simple

This in turn will create a KafkaUser object as well as a kubernetes secret by the same name:

As a side note, the Strimzi User Operator will act as a subordinate Certificate Authority in our architecture, of which, we’ll go into more detail later.

For more information on all things Strimzi, please see their docs here:

https://github.com/strimzi/strimzi-kafka-operator

https://strimzi.io/

Kafka Proxy

Kafka-proxy is as it says on the tin, a small service that connects to your Kafka brokers whereby clients do not then need to deal with SASL/PLAIN authentication and SSL certificates. This is particularly important for Microservices environments where clients can derive from any programming language or Kafka-client driver.

We are using the Kafka-proxy tool as a sidecar in our model so clients need only connect to localhost with no authentication. The sidecar itself will be configured with the certificates and credentials generated from the Strimzi User Operator.

Shipcat (optional)

Shipcat is an open source Rust based automation tool that abstracts and applies complexities associated with Kubernetes and other related external tools such as Vault, Kong and now Kafka.

It works in conjunction with kubectl and helm to deploy limitless Microservices for an organisation. In essence, it renders simplified yaml objects into more complex and complete kubernetes application deployments. E.g:

For more information on the shipcat project, please see the official docs here:

https://github.com/babylonhealth/shipcat

As an FYI, Shipcat is heavily used at Babylon Health for a lot of our CI/CD processes. As an example, you can see the different stages by which we render and apply our data objects (in this case, Kafka resources) in our CI/CD pipelines here:

Alright, lets cook

In this section, I’m going to detail some of the necessary steps to set up the PCA and the user operator.

what you’ll need:

  • A kubernetes cluster, helm, kubectl
  • Amazon MSK Cluster (ensure you allow only TLS authentication when installing)
  • An Amazon PCA instance

Remember, the User Operator will act as our subordinate CA in our kubernetes cluster. This means it will have permission to create certificates or child instances (users) on behalf of our Private CA. In this step, we’ll be creating the User Operator certificates.

#!/bin/bashCA_ARN=XXXX # ARN of your Certificate Authority
CA_REGION=eu-west-2 # The Region your CA is installed in
CERT_SUBJECT="/C=GB/ST=London/O=MyOrg/CN=user-operator" # ensure everything but the CN is the same as your PCA
# Create certificate and sign request
openssl req -new -newkey rsa:2048 -days 365 -subj ${CERT_SUBJECT} -keyout ca.key -nodes -out ca-sign-request.csr
# Ensure you are logged into AWS, This command will issue and sign the user operator certificate
CERT_ARN=$(aws --region ${CA_REGION} acm-pca issue-certificate \
--certificate-authority-arn ${CA_ARN} \
--csr $(cat ca-sign-request.csr | base64) \
--signing-algorithm "SHA256WITHRSA" \
--template-arn arn:aws:acm-pca:::template/SubordinateCACertificate_PathLen0/V1 \
--validity Value=300,Type="DAYS" | jq -r '[.CertificateArn] | @tsv')
# This command will retrieve the signed certificate with the Root key.
aws acm-pca get-certificate \
--region ${CA_REGION}\
--certificate-authority-arn ${CA_ARN} \
--certificate-arn ${CERT_ARN} | jq -r '.[]' > ca.crt

** its important to note here, there are different types of certificates you can create with Amazon PCA. By default you will only be issues with child certificates. In our case, we want to create a certificate that can issue other certificates for our Kafka Users which is why we’ve added —-template-arn arn:aws:acm-pca:::template/SubordinateCACertificate_PathLen0/V1 above.

For more information on the different types of certificates you can issue with PCA. Click on the link here.

You should now be left with two files, the ca.crt and ca.key. Keep these safe as they’ll be used during the User Operator installation.

Next, we will verify the certificates to ensure they serve the purpose we intend:

> openssl x509 -purpose -in ca.crt -inform PEMCertificate purposes:
SSL client : Yes
SSL client CA : Yes
SSL server : Yes
SSL server CA : Yes
Netscape SSL server : No
Netscape SSL server CA : Yes
S/MIME signing : Yes
S/MIME signing CA : Yes
S/MIME encryption : No
S/MIME encryption CA : Yes
CRL signing : Yes
CRL signing CA : Yes
Any Purpose : Yes
Any Purpose CA : Yes
OCSP helper : Yes
OCSP helper CA : Yes
Time Stamp signing : No
Time Stamp signing CA : Yes
-----BEGIN CERTIFICATE-----
xxxxxxxxxx
xxxxxxxxxx
-----END CERTIFICATE-----

Strimzi User Operator installation

If you’re not already familiar with the internals of Strimzi, I would certainly recommend at this point going through the docs above. For simplicity, I’ve extracted the components we need and bunched them into helm projects.

Head over to https://github.com/nhjiejan/MSK-Strimzi-Demo/tree/master/kafka-user-operator and hit clone. The project assumes you’re using Vault and Secret-manager to store and retrieve your secrets. If not, replace the secret section of the chart with your preferred method. Either way, ensure the secrets we created above are referenced in the values of the chart.

Edit the values.yaml file to tailor the installation to your environment and apply the User Operator helm folder

helm apply --name kafka-user-operator kafka-user-operator -f values.yaml

Once the user operator is up and running, check the logs from the pod to ensure all is in order.

Topic Operator install

The Topic operator itself will also use the kafka-proxy in our setup, inside the kafka-topic-operator folder in the github project above is a template with a kafkaUser for the Topic Operator above. The chart will automatically load in the secrets from this user into the kafka-proxy container.

Again, edit the values.yaml to tailor the install to your environment and apply.

After a few minutes you should have a working installation of the User Operator and the Topic Operator as well as a KafkaUser object for the Kafka-topic-operator service

kubectl get ku 

Next, verify the user certificate created by the User Operator. Grab a copy of the ca.crt we created earlier. This file should contain both the root cert and the subordinate CA cert (user operator public cert).

kubectl get secret kafka.user.kafka-topic-operator-dev-msk -o jsonpath='{.data.user\.crt}' | base64 -D > user.crt❯ openssl verify -CAfile ca.crt user.crt 
user.crt: OK

Sample application install

Head into the kafka-simple-client folder in the github project above. Build the dockerfile and store it in your local image repository.

As mentioned above, we’re using Shipcat to abstract and render all of our kubernetes objects. Our CI/CD pipeline will then interpret our Microservice manifest file and convert it into a helm chart. Its important to note, it is at this stage we make use out of a bunch of _helpers.yaml in our shipcat manifest to inject our necessary sidecars and init containers (i.e kafka-proxy). This takes place for the hundreds of services we run.

name: kafka-simple-client
image: quay.io/babylonhealth/kafka-simple-client
version: 0.0.1
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 100m
memory: 128M
replicaCount: 1
regions:
- dev-uk
rollingUpdate:
maxUnavailable: 0%
maxSurge: 50%
command: ['sleep', 'infinity']
kafkaResources:
users:
- name: kafka-simple-client
acls:
- resourceName: some-topic
resourceType: topic
patternType: literal
operation: Describe
host: '*'
- resourceName: some-topic
resourceType: topic
patternType: prefix
operation: Read
host: '*'
- resourceName: some-topic
resourceType: topic
patternType: prefix
operation: Write
host: '*'
topics:
- name: some-topic
metadata:
contacts:
- name: "Nabil Andaloussi"
slack: "@DRKB64WX6"
team: data-operations
repo: https://github.com/nhjiejan/MSK-Strimzi-Demo

** example Microservice Manifest.

The rendered version of this file can be found in kafka-simple-cliet/application.yaml .

Next update the file with your docker images and your bootstrap servers, then run a kubectl apply.

Test the App

Start two exec sessions onto the simple-client pod. On the first session, run the producer:

mvn exec:java -Dexec.mainClass=\
"io.confluent.examples.clients.cloud.ProducerExample" \
-Dexec.args="/conf/client.conf some-topic"

On the other session, run the consumer:

mvn exec:java -Dexec.mainClass=\
"io.confluent.examples.clients.cloud.ConsumerExample" \
-Dexec.args="/conf/client.conf some-topic"

Kafka-Proxy explained

So what exactly is happening at this stage? In short, kafka-proxy starts a local socat session for every bootstrap-server mapping we specify. As we’re using TLS authentication, we’re also mapping our client certificate / key / password (as generated by the strimzi user operator) to the proxy which then handles authentication on our behalf.

- name: kafka-msk-proxy-sidecar
image: quay.io/babylonhealth/kafka-msk-proxy:v0.2.5
imagePullPolicy: IfNotPresent
command: ["/usr/local/bin/kafka-proxy"]
args:
- 'server'
- '--log-format=json'
- '--bootstrap-server-mapping=msk1:9093,127.0.0.1:32401'
- '--bootstrap-server-mapping=msk2:9093,127.0.0.1:32402'
- '--bootstrap-server-mapping=msk3:9093,127.0.0.1:32403'
- '--tls-enable'
- '--tls-ca-chain-cert-file=/certs/ca.crt'
- '--tls-client-cert-file=/certs/user-ca-chain.crt'
- '--tls-client-key-file=/certs/user.key'
- '--tls-client-key-password=$(TLS_CLIENT_KEY_PASSWORD)'
- '--tls-insecure-skip-verify'
env:
- name: USERNAME
value: kafka.user.kafka-simple-client
- name: TLS_CLIENT_KEY_PASSWORD
valueFrom:
secretKeyRef:
name: kafka.user.kafka-simple-client
key: user.password
volumeMounts:
- name: ca-crt
mountPath: /certs/ca.crt
subPath: ca.crt
- name: user-crt
mountPath: /certs/user.crt
subPath: user.crt
- name: user-key
mountPath: /certs/user.key
subPath: user.key
- name: user-ca-chain
mountPath: /certs/user-ca-chain.crt
subPath: user-ca-chain.crt
...
volumes:
- name: ca-crt
secret:
secretName: kafka.user.kafka-simple-client
items:
- key: ca.crt
path: ca.crt
- name: user-crt
secret:
secretName: kafka.user.kafka-simple-client
items:
- key: user.crt
path: user.crt
- name: user-key
secret:
secretName: kafka.user.kafka-simple-client
items:
- key: user.key
path: user.key
- name: user-p12
secret:
secretName: kafka.user.kafka-simple-client
items:
- key: user.p12
path: user.p12
- name: user-ca-chain
emptyDir: {}

The client configuration is now much smaller and simpler:

security.protocol=PLAINTEXT
bootstrap.servers=127.0.0.1:32401,127.0.0.1:32402,127.0.0.1:32403
ssl.endpoint.identification.algorithm=

Conclusion

While a lot of organisations gear up to be more cloud or container native (K8s), it made sense for us at Babylon Health to move as many key objects and resources that we associate with a service deploymentto kubernetes as much as possible. This of course includes Kafka resources, and thanks to the Strimzi project, that is now possible.

It also means we no longer have separate support tickets and operations to allows devs / services access to Kafka, they are now self governing while confined to our ruleset. Configuring a JVM or non-JVM Kafka client application to use SSL encryption and SASL users and ACLs is not what our Babylon engineers should be spending their time on. They can now focus on the business domain and trust the automation to deploy their service with the security and governance in place.

Next Steps…

Our next steps involve moving away from the default SimpleAuthorizer in our KafkaUser ACLs definition and more towards an Open Policy agent so we can globally define rulesets not just for Kafka permissions, but many other database permission sets.

For more information on that, checkout this blog post here.

Special thanks to folks at Strimzi for all their work and support on the project, and to Jeremy Frenay for his work on Kafka-proxy and reviewing this post!

--

--