Amazon MSK PrivateLink solution for Kafka clients with SASL/IAM & SASL/SCRAM authentication

Published in

Airwalk Reply

9 min readJun 7, 2022

Expanding AWS infrastructure projects with first-class Open-Source Software.

Organisations running an Apache Kafka infrastructure on their own servers (Physical or Cloud) may have already taken the deep dive, or perhaps reviewed the Pros and Cons of migrating their infrastructure to an AWS Managed Streaming for Apache Kafka (MSK).

Some firms might find the MSK benefits of existing migration tools, integration with other AWS services, in-place version upgrades and ongoing maintenance of the Apache Kafka Clusters a very appealing proposition. Amazon MSK might sound even more attractive if Apache Kafka is a critical component of your business infrastructure and you operate in a regulated environment with high security controls.

In all fairness, the Apache Kafka’s development community has done a tremendous job adding a large number of features to increase security in a Kafka Cluster, enabling this product to meet a lot of organisations’ requirements.

Apache Kafka security features include:

Authentication using either SSL* or SASL.
SASL mechanisms supported include SASL/GSSAPI, SASL/PLAIN, SASL/SCRAM-SHA-(256/512), SASL/OAUTHBEARER (from V2.0).
Encryption of data between Kafka clients and Brokers or between Brokers can be enabled.
Pluggable authorisation supporting external Auth services.

* Apache Kafka allows clients to use SSL for encryption of traffic as well as authentication.

As we’d expect, the above security features are available through configuration when provisioning the Amazon MSK service, and it’s worth noting that choosing SASL authentication does not mean that we must sacrifice encryption. In most environments, TLS encryption between Brokers and Brokers and Kafka Clients should be enabled by default. On MSK, Amazon manages the Brokers’ and Zookeepers’ TLS certificates on a trusted CA, which simplifies considerably the management and renewal of encryption certificates for data transfers associated with our Kafka Cluster.

However, from a security perspective, the key feature that Amazon MSK makes available to its customers is the option to use SASL/IAM access control for MSK, enabling the creation of IAM policies and roles that can be tailored to match the ‘least-privilege’ permissions required by producers, consumers, and the organisation’s Kafka users. This means you can treat all your Kafka infrastructure, user access and apps permissions like the rest of your AWS resources, easily keeping track of your entire ‘Kafka deployment’ within well known AWS resources deployable with Infrastructure as Code tools, such as Terraform.

Companies and teams who have been running their own Apache Kafka Clusters over months or years and had to develop their own security mechanisms (ACLs), may feel uncertain about the switch to MSK and wasting the investments made in tailoring their Kafka secure environments to their own specific business cases. However, for this scenario, it is perfectly possible to provision Amazon MSK with both SASL/SCRAM and SASL/IAM enabled simultaneously. This gives teams the option to retain their existing Kafka native authentication mechanism until everything is in place to migrate to or deploy SASL/IAM MSK authentication for new Kafka users, consumers, and producers.

Security requirements

In this write-up, we will explore a solution to provision a MSK deployment pattern for secure environments.

The Kafka clients (producers and consumers) are required to be deployed in a separate VPC (or AWS account) connecting securely to MSK through AWS PrivateLink. Authentication will support SASL/SCRAM and SASL/IAM to provide a granular set of least privilege permissions to consumers and producers.

To further secure the infrastructure, another mandated requirement is to enforce end-to-end TLS encryption between clients and servers whether provisioned in the same or in separate VPCs. This is, of course, supported (and recommended, despite the performance impact).

There are good reasons why using PrivateLink and having a segregated and layered approach to building secure infrastructure has become a popular pattern in secure environments, as opposed to the more ‘trusted’ architecture provisioned using VPC peering. Running services in their own VPCs exposed only through a PrivateLink allows us to tightly control which internal clients are allowed to connect to a specific service through the VPC endpoint Security Group. Also, from the service provider VPC side (MSK), we can control which AWS principals are allowed access to the service.

Furthermore, the Kafka clients’ Security Groups restrict which hosts have access to the MSK Cluster through the VPC endpoint.

The problem…

Firstly, in our layout, the client VPC connects through PrivateLink to the VPC endpoint service and this would involve a NLB acting as a TCP pass-through load balancer fronting the Kafka Brokers.

Our requirements include end to end TLS encryption, but the NLB is unaware of this TLS traffic and will attempt to connect to any available targets (e.g. MSK Brokers ENIs) in each availability zone. Connections will fail when a Broker receives a TLS client request meant for another server, and, put simply, the PrivateLink’s NLB is unwittingly creating havoc in the sophisticated Kafka distributed system that communicates via a high-performance TCP network protocol with its own built-in fault tolerance.

A Kafka client will likely need to maintain a connection to multiple brokers, as data is partitioned and the clients will need to talk to the server that has their data.

Secondly, the NLB target groups needs to be attached to the Brokers and Zookeepers IP addresses. This involves a bit of planning while provisioning the solution with Terraform, as we are only able to resolve the ‘MSK IP addresses’ once the Cluster has been created. From a security perspective, it is also difficult to create the required MSK ingress rules as no ingress Security Group referencing is currently possible when using a NLB for the MSK Security Group. When the NLB is directly fronting the MSK ENIs, our code will need to refer to the NLB IP addresses as ingress source, adding some complexity to the solution.

PrivateLinks design options…

Let’s take a look at some possible PrivateLinks options to solve our ‘TLS connectivity problem’ through our NLB.

Incidentally, we will not consider any options where compromises need to be made on our authentication methodology (both SASL/IAM and SASL/SCRAM) of choice, and we will not compromise on Kafka security either! TLS encryption between all clients and servers is mandatory in our environment, as is Host name verification (enabled by default as of Kafka 2.0) to prevent man-in-the-middle attacks.

Option 1
Our first option would involve creating several Cross VPC PrivateLinks to match the number of MSK Brokers and Zookeepers, and use Route53 to resolve each MSK hosts through individual endpoints in the Kafka client VPC. A detailed explanation is available in the first pattern described in this AWS blog, where SSL authentication (mTLS) was implemented.

Kafka Client resolves the Broker DNS to unique VPCe IP addresses

Assuming we have 3 Brokers and 3 Zookeepers spread across availability zones, this would involve provisioning 6 sets of VPC endpoints, VPC endpoint services, Security Groups, and NLBs. This implementation would be quite wasteful in terms of resources and costs and, furthermore, would not scale easily should our infrastructure need to increase or scale down on demand.

Option 2
This option introduces - in our design - a tried, tested and hardened by default Open-Source Software (OSS) that I have been using since 2007 and deployed across many infrastructure projects! A reverse proxy with TLS support and one of the best OSS I have come across is HAProxy (Hats off to Willy Tarreau and all contributors who have developed this excellent product!).

HAProxy is a free, very fast and reliable reverse-proxy offering high availability, load balancing, and proxying for TCP and HTTP-based applications.

In this enhanced design we only require one PrivateLink with one NLB, sending traffic to an AutoScaling Group with EC2 instances in each availability zone running HAProxy in TCP mode.

Kafka client access all MSK hosts through a single VPCe / NLB with IAM/SCRAM auth.

HAProxy, in our solution, is also configured as a TCP pass-through proxy, with a frontend that will retrieve the Server Name Indication (SNI) from the TLS Handshake enabling ACLs conditions rules to send traffic to the correct backend (Brokers/Zookeepers).

Here is a useful HAProxy blog describing the configuration in more detail.
Another advantage of using HAProxy is its ability to configure an internal DNS resolver, removing the need to run a MSK post-install Terraform sub-module to resolve the Brokers/Zookeepers IP addresses (ENIs).

This will be my preferred path, as it’s often best to start implementing infrastructure projects with as much simplicity as possible 😃.

Terraforming our MSK PrivateLink HAProxy Solution

We will break down and structure our Terraform code in separate modules under version control, for maintainability and re-usability across projects.

The MSK project parent module will call Terraform submodules to:

Create the VPC hosting, the MSK ENIs, VPC endpoint service, NLB and HAProxy Auto Scaling group
Create the VPC for the Kafka clients
Create the VPC endpoints and associated IAM endpoint policies for S3, SessionManager, Logs, SNS, Monitoring, SecretsManager, KMS and STS
Create the MSK Cluster IAM authorisation policies and roles for your organisation according to common use cases
Create the NLB, NLB listeners and Target Groups for ports 2182, 9096, 9098
Create the MSK VPC endpoint service, and allowed Principal in the MSK VPC
Create the Kafka Clients MSK VPC endpoint that will connect to the MSK service
Create the Kafka Clients Route53 DNS records to send traffic for the Brokers and Zookeepers to the MSK VPC endpoint (PrivateLink).
If needed for migration, create the SCRAM authentication secrets in SecretsManager, then run the secret association resource to bind the secrets to your cluster.
Create the CloudWatch log group, S3 bucket and/or Kinesis Firehose for your MSK logging requirements
Create the HAProxy instances managed by an Auto Scaling Group
Create the Amazon MSK Cluster service

Let’s now take a look at some of the elements needed to implement this solution.

MSK Cluster
We will use the following code in a child module (called msk_cluster) to provision the MSK Cluster resources:

In the above code snippet, a dedicated customer-managed KMS key is created to encrypt the MSK volumes at rest. If we hadn’t provided one, an AWS-managed KMS key would automatically be used for encrypting the MSK volumes. Encryption enabled for data in transit between clients and brokers; between broker nodes are default values, and declared in the code only for readability. MSK SASL Client authentication is enabled for both IAM and SCRAM authentication to enable migration from native Kafka to IAM MSK access policies.

In a separate file we will declare the Cluster outputs required by the other modules:

msk_cluster module outputs

2. HAProxy instances
We will create another sub-module dedicated to the HAProxy resources, with IAM policies/role, a Launch Template, an Auto Scaling group, and Security Group.

This module needs to be invoked after the above MSK Cluster module has run, as we will use the cluster connections strings (e.g. bootstrap_brokers_sasl_iam & zookeeper_connect_string_tls) to derive the MSK DNS names HAProxy will connect to.

In the above code, we create a Launch Template for the Auto Scaling Group that will pass the MSK Brokers and Zookeeper DNS host names to the HAProxy instances `user_data` script. The image_id of the Launch template refers to a customised Amazon Linux2 AMI with HAProxy, installed with:

amazon-linux-extras enable haproxy2
yum clean metadata
yum install -y haproxy2 socat jq nc

3. HAProxy configuration
The Launch Template user_data template will generate our HAProxy configuration with our Cluster’s Zookeepers and Brokers host names:

“${path.module}/templates/user_data.tpl”

The user_data installs the HAProxy configuration on launch, and the frontends are configured to send the client’s requests to the correct MSK backend server based on a name match with the TLS SNI extension.

For convenience, a haproxy.stat script is also added to the current directory to help with the quick visualisation of HAProxy stats from the command line on each instance, although configuring the HAProxy Stats page will also do…and will often prove most useful 😃.

Conclusion
Secure infrastructure designs will occasionally define requirements that may be challenging to implement solely with native Cloud services.

Fortunately, there is a wide range of Open-Source Software available to engineers that can be easily integrated in your Cloud infrastructure code.
HAProxy was an obvious solution to implement MSK with PrivateLink because I was aware of its capabilities to extract the TLS SNI extension, and also because it’s well suited to build secure platforms due to its robustness and high configurability as a layer 4 or layer 7 reverse proxy.

Amazon MSK PrivateLink solution for Kafka clients with SASL/IAM & SASL/SCRAM authentication

Security requirements

The problem…

PrivateLinks design options…

Terraforming our MSK PrivateLink HAProxy Solution

Written by Laurent Allegre