msk upgrade: 2.6.2 to 3.5.1

John Zen
2 min readJun 26, 2024

--

My MSK cluster is 2.6.2. End of support is 2024–09–11. Reference

Broker Size

Base on aws documented msk’s best practices, msk need to right size broker size based on number of partitions.

The number of partition per broker can be obtained by

CloudWatch -> All metrics -> (search MSK -> select AWS/Kafka > Broker ID, Cluster Name, add search PartitionCount).

Tool such as kafdrop shows partition leader count which correspond to LeaderCount in CloudWatch.

CloudWatch -> All metrics -> (search MSK -> select AWS/Kafka > Broker ID, Cluster Name, add search LeaderCount).

To estimate cost of MSK broker:

Backup Investigation

Due to wide dependency on MSK cluster, the impact of an upgrade failure or client appplications not able to work after upgrade is huge.

Investigated creating a replica msk cluster and keep it update using Replicator.

However, this scheme does not work as Replicator

  • create topics in target cluster prefix with `<sourceKafkaClusterAlias>.topic` to differentiate with other topics.
  • Limit: 750 topic per Replicator.
  • Does not replicate write ACL

Upgrading

I upgraded in two steps:

  1. Change broker size
  2. Upgrade MSK version

Both of these operations take long time. I monitor by:

At MSK console

  • > Metrics tab

At MSK console

  • > Cluster operations

For MSK version upgrade, the entry is a link; clicking in it shows upgrade progress bar.

At CloudWatch

CloudWatch -> All metrics -> (search MSK -> select AWS/Kafka > Broker ID, Cluster Name, add search ActiveControllerCount).

The average value of ActiveControllerCount is calculated using 1 / n where n = number of brokers.

During the upgrade, each broker will be rebooted sequentially in a rolling fashion. Each reboot takes roughly 5–10 minutes. When a broker is rebooted, it will drop from the cluster and the ActiveControllerCount metric will spike to higher number (as n is reduced by 1) until the broker rejoins the cluster. Hence, you will see n spikes in ActiveControllerCount during the upgrade process, representing a reboot for each broker in the cluster.

--

--