Working effectively with Amazon Managed Streaming for Apache Kafka (MSK) using iam auth

Jszczepankiewicz
gft-engineering
Published in
7 min readMay 10, 2023

--

One messaging system to rule them all

I am a big fan of industry standards, especially if they’re open for implementation without paying royalties. If you are a veteran in the Java ecosystem you might remember the company Sun Microsystems and how standards around the Java ecosystem were developed. The motto from that era was „Cooperate on standards, compete on implementation”. This was a committee where major IT players decided on API and later multiple implementations were introduced for the same standard API by competitors. In the messaging area JMS was fundamental. The later financial industry created AMQP. Nothing was suggesting that some strong new player could emerge outside of open standards implementations. Though even an innovative RabbitMQ implementing AMQP built on the Erlang platform could not stop the increasing domination of Kafka.

AWS is one of the cloud providers which does not hesitate to offer multiple products to solve the same problem. And so, Amazon Managed Streaming for Apache Kafka (MSK) was introduced. In this article, we will focus less on the MSK product itself but rather on aspects of authentication in MSK and specifically on how to avoid some common pitfalls while integrating MSK with AWS native IAM security.

MSK authentication modes

It is hard to justify missing authentication and authorization in any production-like workflow, regardless of whether you build something for a hobby or a full-blown e-commerce solution. One of the value-added features in MSK is support for multiple authentication schemes in the same cluster. This is a great feature when you can mix e.g., hosted legacy applications utilizing SCRAM and at the same time using IAM authentication for your clients, integrating with more legacy products subscribing to the same topics using different authentication modes.

We will not cover IAM here because there is plenty of good info on the web on how it works. It is enough to say that AWS IAM is allowing granularly defined access including cross-account permissions in a single unified way across the whole AWS.

Some of the most important benefits of using IAM in MSK over SCRAM / mtls:

  • no typical secret/cert rotation procedure is required, since we use short-living credentials from AWS STS out of the box we get short-living tokens. Hence once created/shaped the access you can forget about the regular overhead correlation with secrets rotation
  • no need to introduce ACL since all the access could be granularly (per topic/action type) shaped through IAM. This means you can grant read permissions to topic A while write permission to topic B

All looks good so why bother using other authentication schemes other than IAM? These authentication types still make sense at least for the following situations:

  • you have non-java clients that require connectivity to Kafka
  • you do not like the quotas that are specific if you are using AWS IAM mode. See below for details.

Please note that also if you want to use MSK Connect you need to use IAM authentication since no other authentication mode is supported. Plans to support SCRAM in MSK Connect are for the end of 2022.

MSK IAM quotas

So, when you look at quotas below:

there are the following specific quotas related to IAM authentication (published 2022–06, please note that they might change in the future):

  • there is a soft limit (under your control) of 3000 TCP connections per broker at any given time
  • there is a soft limit of 20 NEW TCP connections per broker per second for all broker types, except for kafka.t3.small which allows 4 connections here per second.

In most of the workflows, these limits are tolerable. Please note that there are also other important quotas not related to authentication like the number of partitions, although we will not focus on that in this article. Please read more at:

https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#partitions-per-broker

Let’s now take a brief look at language/platform support for client connectivity.

Supported programming languages

At the time of writing of this article (2022–06) support for AWS IAM authentication mode is only provided for a client using JVM (Java, Kotlin, Scala) through the library https://github.com/aws/aws-msk-iam-auth. It is recommended to use at least version 1.1.3 which contains an important fix to increase fault tolerance when interacting with the AWS STS service.

Support in other than JVM languages is under development. Some good hints are to be found in https://github.com/aws/aws-msk-iam-auth/issues/10

Troubleshooting your Kafka IAM authentication

Now since we know what to expect from language support and supported quotas it is time to focus on enabling the connectivity. The rest of the section of this article will focus on some common pitfalls of using AWS IAM with MSK and how to avoid them. Please note we will not cover details of how to shape your policy for IAM access. Since this is common to the whole of your AWS ecosystem, we assume it’s not required to describe it here.

Problem: cluster not reachable

One of the common initial pitfalls with MSK and IAM is constructing a client URL with an invalid port. More info at https://docs.aws.amazon.com/msk/latest/developerguide/port-info.html

For security reasons, private network access is recommended. This leaves port 9098 that should be used by clients to connect.

The most common approach to getting a client URL is to go into the console after MSK creation and use the client connectivity URL to copy and paste to your client code. What is interesting is that if you refresh this URL you will find that the order is changing. That’s not a bug, this is to balance the traffic between all nodes. Below you can find detailed instructions on how to confirm the client URL for the IAM connection. Please mind that there are separate authentication methods.

If you have the correct URL for the IAM port and still cannot connect, suspecting some network problem you might want to check the following areas:

  1. Check if there are routes between client & MSK interfaces
  2. Confirm the Security Group Inbound rule on MSK allows input from the client
  3. Confirm Security Group Outbound from client allows traffic to MSK
  4. Confirm no Network Access Control List is blocking this traffic

I recommend for that purpose running an excellent VPC Reachability Analyzer

https://docs.aws.amazon.com/vpc/latest/reachability/getting-started.html.

Please note that you might first need to find out which ENI is used by Kafka which you can do by enumerating Elastic Network Interfaces and checking the allocation status. This is required since the Reachability Analyzer does NOT operate on such high-level objects as „MSK cluster” and requires you to specify exactly which ENI you want to connect to. Below you can find how to find out ENI allocated to the broker:

Some of the other useful info you might also use to debug further connectivity: https://aws.amazon.com/premiumsupport/knowledge-center/msk-cluster-connection-issues/

Problem: org.apache.kafka.common.errors.SaslAuthenticationException: Too many connects

As we mentioned before two important quotas should be included in the design and monitoring. The first is related to exceeding a static number of max connections to the broker while the second is about max NEW connections allowed per second. More context: https://github.com/aws/aws-msk-iam-auth/issues/45

Following mitigation options are available:

Option 1: make your service more immune to that by introducing back-off and recovery by configuring your libraries. More info in https://github.com/aws/aws-msk-iam-auth#failed-authentication-too-many-connects

Option 2: Increase the number of connections by using a support case to increase these limits. Though this may destabilize your cluster at some point, hence this change should be made after extensive testing. Instructions at https://docs.aws.amazon.com/msk/latest/developerguide/limits.html

Option 3: add more brokers to your cluster but please keep in mind you need to do it in symmetry to all availability zones picked up by you when building the cluster. You can have as many as 30 brokers per cluster. The recommendation for monitoring and alerting is to rely on CloudWatch metrics (https://docs.aws.amazon.com/msk/latest/developerguide/metrics-details.html)

For dynamic limits please use ConnectionCreationRate available under [AWS/Kafka] [Broker ID, Cluster Name]

For static limit please use ConnectionCount

In both cases, it is recommended to put an alert somewhere below quotas (i.e. 80%) to be able to spot the issue before it manifests in degraded application health.

Problem: Authentication/Authorization Exception and no authExceptionRetryInterval set

Why it happens: Spring Kafka listener has default behaviour of NO reattempts after Authentication Exception happens. See

https://docs.spring.io/spring-kafka/api/org/springframework/kafka/listener/ConsumerProperties.html#setAuthExceptionRetryInterval(java.time.Duration) for more details.

This might result in your application not surviving temporal problems with broker connectivity or problems with credentials retrieval from AWS STS. By setting this value you are switching your spring application to introduce recovery to that problem.

Conclusions

Although it brings many benefits, IAM authentication in MSK requires more care from a consumer tuning perspective than other authentication types. Apart from configuring Kafka clients, it is strongly recommended to put alerts on important factors of your Kafka MSK cluster and set up alerts so that the operations team can react to incoming problems before they manifest.

--

--