From ActiveMQ To Amazon MQ : Why And How We Moved To AWS’s Managed Solution
In any reasonably successful startup that proves the test of the time (more than a few years) and grows to a reasonable size (300+) there’s always a trail of technical debt left in the wake of the numerous software engineers who have come and gone. As our company has grown and evolved, Bench Engineering has been through a few cycles of who we are and so we’ve taken on numerous pieces of technical debt that have no clear ownership. In the 18-month-old Platform Team, we’ve spent the past 12 months striving to get a handle on these things. Mostly this has involved Terraforming all the things, but has also involved building more robust systems we can kick without them falling over.
ActiveMQ is a service that’s been at the core of Bench infrastructure for longer than most, if not all, of the engineers working here. It had been doing its job of relaying messages between most of our services and working well enough we almost forgot about it. Since it never complained or caused us issues, it was easy to forget that it was one of the most critical and fragile pieces of our infrastructure. Luckily, my team has been killing it for the past 9 months, rolling out Kubernetes and smashing through the complicated, undocumented scripts and manually spun up infrastructure that grows organically within a startup. Our ActiveMQ SPoF (single-point-of-failure), as much as it tried to pretend it would never fail us, didn’t stand a chance.
We originally started our plan to address the vulnerability of Bench’s ActiveMQ many moons ago. I remember the fall (autumn) of 2017 fondly. The sunlight gleaming through the thinning trees outside and across our whiteboard as we sketched solutions to the technical challenges of making our ActiveMQ instance (singular) more robust (plural). The sound of someone streaming AWS re:Invent could be heard in the background. Did we hear the words “ActiveMQ” or “Amazon MQ”? Or both? Yes, AWS had done it again. Killed a perfectly good white boarding session by releasing a solution to our problems. Amazon MQ would be AWS’s managed version of ActiveMQ and our project would be shelved a little longer while we waited to see what this solution would look like.
Once Amazon MQ was available and an implementation window was once again available we queued up the work to move over to Amazon MQ.
Our challenges with moving to Amazon MQ mostly stemmed from our legacy trust of our internal network. Our services used neither SSL or username and password to access ActiveMQ. Amazon MQ made this mandatory — which ain’t a bad thing.
We needed to move 13 of our services (mostly in Scala/Java and some Python) from our own ActiveMQ to Amazon MQ without any downtime or losing any messages, while turning on SSL, credentials, and the failover configuration in the process. Also, not all of these Scala/Java services used the same libraries to connect to ActiveMQ. Some, particularly our monolithic “microservice” (it’s a journey), connected in multiple ways to ActiveMQ.
We also didn’t want to mutate our existing ActiveMQ instance if possible as we would risk breaking our working system.
Lastly, there was no clear consumer-producer boundary between our services, so splitting the migration between the message producers and message consumers wasn’t really an option.
Please note: We’re not ActiveMQ experts and had no intention of becoming so for this project, so apologies to any ActiveMQ administrators that are cringing at any of our approaches.
Who doesn’t love DNS? When migrating services this is generally the best tool in our bag. Set up a CNAME to the old instance with a low TTL, then flip it to the new instance. Everything moves over without having to quickly reconfigure twenty services at the same time. This was our plan for switching all our services to Amazon MQ. But first we would use AWS Security Groups to stop any traffic to the old instance.
We burnt a little too much energy on trying to leave our legacy ActiveMQ untouched and insert some kind of a shim to proxy optional SSL and credentials in front of it. This would allow us to migrate one service at a time. We looked at using ActiveMQ ProxyConnector, trying to decipher what the limited documentation was trying to tell us, interpreting it six different ways and failing to get any of them to work for our needs. That’s because we should have been looking at NetworkConnector! Much better documentation. All we had to decide was whether we needed “Connectors in each direction” or a “Duplex connector”. No, wait, what are we doing? Moment of clarity…
We realized it was much simpler to replace our existing ActiveMQ with a new instance that supported both non-SSL and SSL at the same time. Amazon MQ used different ports (61615, 61617) for its SSL interfaces to our ActiveMQ’s non-SSL ports (61614, 61616), so we added those interfaces to our existing setup. We replaced our legacy ActiveMQ instance with a new instance of ActiveMQ configured with these four interfaces. These all went to the same queues, enabling our services to connect to either the non-SSL or the SSL endpoints. That way, we could migrate services over to the SSL endpoints one at a time. Once all the services were using the SSL endpoints and configured to pass credentials, we would make the grand switch over to Amazon MQ where these things were mandatory.
Implementing the passing of credentials was easy, since you can pass them to ActiveMQ and it will just ignore them if not configured to use them, as was the case with our legacy ActiveMQ. Therefore, we could implement username and password one service at a time, all of which would initially be ignored. When we pointed the DNS at Amazon MQ and our services reconnected, these credentials would start being used.
A choice we had made previously was not to have a failover instance of ActiveMQ. Partly this was because we were planning to move to Amazon MQ and partly because we thought, in the short-term, it would be easier to fix the existing instance than deal with messages split across two independent ActiveMQ instances.
Amazon MQ does have fail over support if you choose the Active/Standby Broker for High Availability, rather than the Single-Instance Broker. There’s no option to migrate from Single-Instance to Active/Standby, so we had to start with the Active/Standby from the beginning. A general rule we’ve found of AWS is that if you opt-in to fail over support (aka Multi-AZ in RDS land) then you increase your chances that AWS will perform maintenance on the underlying infrastructure, triggering a failover. On the other hand, without a standby instance, any infrastructure issues or maintenance will cause downtime.
Nevertheless, to keep the move from our single instance of ActiveMQ to Amazon MQ simple, we decided to migrate services to point only at the active instance on Amazon MQ. Then shortly afterwards we updated the configuration of each of our services to support fail over to the standby. This worked, but we found it unobvious to know which was the current active instance on Amazon MQ. To determine this we would exec into a running container with network access and use telnet test access to the two endpoints.
We understood that there was some risk here that AWS would fail over to the standby before we had configured our services to support it.
Because our migration involved only changes to a DNS record and enabling/disabling a few security groups, rollback was a trivial process as long as we didn’t end up leaving messages on Amazon MQ and rolling back without consuming them.
Not zero downtime
As one of the largest infrastructure changes we’ve taken on and the widespread reliance on ActiveMQ, we proactively decided to take our services down during this migration. This allowed us to stop all consumers and producers, switch-over the DNS, and cut over to Amazon MQ gracefully.
We chose a time in the evening when activity was low. Part of our process was to check the queues. If messages hadn’t been consumed after services had been shut down, we’d start up specific consumers until all queues were empty and then shut them back down again before shutting down ActiveMQ. This process was needed for several services. As mentioned previously, unfortunately there wasn’t a clear separation between producers and consumers to enable shutting things down in order of producers followed by consumers. This alternative cleanup process was good enough to get the job done in a timely manner.
In total, Bench services were down for 20 minutes before coming back up running on Amazon MQ. Once we were happy that things were running smoothly, we followed up by rolling out configuration to each service to support fail over.
We tested fail over on our Staging environment thoroughly before trying it out on Production. This involved a kubectl exec to one Kubernetes pod of each service and watching TCP connections to both the active and standby ActiveMQ instances (now both on Amazon MQ). Watching as the failover happened, we found that one service failed to fail over correctly. It was because of a misconfiguration. This was trivial to fix, but it ensured we doubled-down on reviewing the Production configuration one more time before testing there.
The Production failover worked well. Amazon MQ provides only the ability to “reboot”, which will fail over to the standby and then back again. This results in a minimum of two failovers in a row when testing. The failover worked so well in Production that we did it again straight after, once again watching the TCP connections swiftly move over to the standby and then, shortly afterwards, back again. It was a beautiful sight.
Checking our service logs in Splunk, we found no errors, or anything unusual.
We will still use Datadog to monitor ActiveMQ, but it’s much simpler to get all the metrics via CloudWatch than having to configure JMX to filter the 4,000+ ActiveMQ metrics down to Datadog’s allowable 350 per host.
Amazon MQ itself is still young, so there were some obvious short-term improvements that could be made. We passed these onto the team at AWS…
- Ability to see the current active endpoint in the UI
- Ability to fail over permanently to the standby, similarly to RDS, since reboot fails-over twice
- Ability to add a new Security Group without rebuilding a new instance
- Ability to change the maintenance window without rebooting
- Ability to change subnets without rebuilding instance
- Medium size broker instance types
A lot of things could have been simpler with our migration. If we had already been using SSL and credentials, there would have been a lot less work to do. If we had had a proper delineation of message producers and consumers across our services, we may have been able to roll over more smoothly. But twenty minutes of downtime and zero messages lost in the process is a fair cost to add a significant amount of reliability to such a large and ingrained piece of our infrastructure.