How we moved 100 million chat messages with no downtime — Part 2

Published in

Beekeeper Technology Blog

7 min readNov 30, 2022

How we migrated chat data between 2 live systems with no downtime and without production incidents and all-nighters using Kafka and Debezium. Stats and outcomes for the tech savvy.

What is this about

This is the second part of our real-time chat migration article. Last time we explained why our business use case required us to do a real-time migration between two live chat systems. We shipped all the historical data and we set up a pipeline that also moves real-time incoming data from our legacy Chats V1 system to Apache Kafka using Debezium¹ CDC without touching any code. We will now present the second step, writing the data to our new Chats V2 system. Make sure to check out Part 1 before reading on in order to have all the context.

Where we left off

By now our CDC pipeline has exported all the historic Chats V1 data to a Kafka topic and is capturing any real-time changes as Chats V1 is still running. We are ready to start writing to Chats V2.

Part 2: Shipping the data to Chats V2

As explained in Part 1, Chats V1 and Chats V2 are powered by very different technologies, the former being a REST + long-polling system while the latter being powered by MongooseIM² and thus using XMPP over TCP or WebSocket.

Getting data from Kafka and converting it to a Chats V2 compatible format is achieved via the bridge microservice (Chats Migration Bridge). It consumes data from the Kafka migration topic and translates the events into the required UPSERT operations for both databases:

The MongooseIM DB where it will write messages and inbox information
The Beekeeper Chats Service where it will write group chat and membership information

The Chats Migration Service’s logic is fairly low-level as it writes directly to the Chats V2 databases, bypassing the business layer of the 2 micro-services. This ensures a significantly faster implementation alongside better performance for the migration. Because of this, the service also needs to connect to 3 databases (the MySQL read replica and the two Chats V2 Postgres instances).

There will be bugs. How to ensure a smooth potential rerun?

MongooseIM uses XMPP for chatting and thus implements a tweaked version of the XMPP XML stanzas³. An example is presented below:

<message type=”chat” from=”eac0b9b0–4e7d-4c0c-8e8f-4369455591d4@isac.andrei.bkpr.link”
 to=”6087f277-e4a3–4979-b4d8–4dada3a408f1@isac.andrei.bkpr.link” xmlns=”jabber:client”>
 <body>Let&amp;apos;s try the new chats</body>
 <message-hints hide-receipts=”true” xmlns=”bkpr:xmpp:message-hints:0"/>
</message>

While bypassing the business layer of Chats V2 has advantages, doing such low level mapping for all cases needed for a complete chat migration is difficult and bug-prone. Hence, ensuring the possibility of rerunning the migration in case of bugs, without having to do manual cleanup or high risk delete operations, is critical in this case.

Accordingly the service uses UPSERTs⁴ exclusively as this guarantees the fact that the entire migration process can be re-run if bugs are discovered (there were quite a few reruns in our test data centers, more on that in the Execution section). An example of this is presented below:

INSERT INTO mam_muc_message (id,room_id,sender_id,nick_name,message,search_body,origin_id)
VALUES (:id, :room_id, :sender_id, :nick_name, :message, :search_body, :origin_id)
ON CONFLICT ON CONSTRAINT mam_muc_message_pkey
DO UPDATE SET sender_id = EXCLUDED.sender_id,
             nick_name = EXCLUDED.nick_name,
             message = EXCLUDED.message,
             search_body = EXCLUDED.search_body,
             origin_id = EXCLUDED.origin_id

Actually running the full pipeline. Binlogs and IOPS

Deployment and Execution Steps

As a reminder from Part 1, our product runs in multiple data-centers across various regions, powered by AWS and GCP. We have a microservices-oriented stack that runs inside Kubernetes. The infrastructure is managed via Terraform.

For AWS, continuing the deployment process from Part 1 is was quite straightforward:

Deploy the Kafka topic and configuration ☑️ (Part 1)
Deploy the MySQL Read Replica ☑️ (Part 1)
Deploy the migration Kafka Connector and start the snapshot of the historical data. ☑️ (Part 1)
Deploy the chats migration bridge micro-service and monitor the migration ⏭
Run a correctness check to make sure the migrated data is correct ⏭

Migrating & Monitoring

Once the snapshot was completed and the data was in the migration Kafka topic, the actual migration could start. We went with 25 partitions for the topic for a reasonable amount of parallelism and scaled up the migration bridge microservice to 25 pods, one pod consuming one partition.

Monitoring is done primarily via Grafana (both visual dashboards and alerts). The metrics that we monitored include:

Migration Bridge throughput
40+ Errors and warnings metrics
Chats V2 API response time
MongooseIM request time
ChatsV2 Databases Disk Usage
ChatsV2 Databases CPU and Memory
ChatsV2 Databases IOPS Burst credits6
ChatsV2 Databases CPU credits7

Below we present the results from one of our AWS production data-centers. Infrastructure and data set details are presented in Table 1.

Execution on AWS

Table 1— Infrastructure and data set details

The amount of customers that were already using ChatsV2 at the time of the migration was fairly low at the time of the migration and thus the system was running on top of small database instances and nodes.

The migration process itself has two important parts, inserting group chats and memberships into the group chats database and inserting messages and reading events into MongooseIM’s messages database. The process went smoothly in production, with infrastructure resources within reasonable thresholds (we allowed ourselves to get quite close to 100% CPU before scaling down the migration pods). See Plot 1 and Plot 2.

Legend:

“chats-database” is our MongooseIM database
“microservices” is our internal group chats database, storing metadata, such as group chats titles, member roles and many more

Plot 2 — Destination Databases CPU Usage (%)

There was one problem that we faced on AWS however, IOPS burst credit exhaustion on the messages database. For our typical usage patterns, neither permanently increasing database disk size (Amazon ties the IOPS performance to the disk size⁵) nor opting for provisioned IOPS made sense financially so we decided to not scale up the hardware for the migration and run the last part at a lower speed. Due to the high amount of data transferred during the migration, after the burst credits were exhausted our speed dropped by 1 order of magnitude. Plot 3 illustrates the IOPS credits exhaustion during the migration.

Plot 3 — Used IOPS Burst Credits for destination databases (%)

Execution on GCP

Google Cloud DB read replicas cannot be used for CDC because unlike AWS, binlogs are not available for read replicas⁶. Accordingly, we decided to use a scheduled maintenance window for our GCP data center. The downtime was required only for the snapshot phase (see Part 1) which only took 20 minutes for this smaller data set, thus this was not a major inconvenience. While downtime was required during the snapshot, normal service could be resumed on Chats V1 for real-time part of the migration.

However, the migration was much faster in GCP because it did not suffer from IOPS limitations. Due to the different pricing model which in our case enabled us to have around 10x the IOPS performance we had on AWS. The results for our smaller GCP DC are presented below:

Conclusion and Learnings

There were some interesting learnings along the way:

There was a lot of implicit knowledge and technical complexity in this project which led to some bugs being found in later stages and a couple of re-runs being required.
Re-runs of the migrations are significantly slower due to the very nature of the UPSERT operations. Upserting also takes disk space because we cannot easily reclaim space from dead tuples⁷.
The exhaustive list of metrics we collected were crucial in validating correctness and identifying bugs.

All in all, the migration was quite a smooth journey. The defensive programming approach we took and all the metrics that spot inaccuracies and errors paid off. There were no production incidents, very few customer complaints, and nights spent with our eyes glued to the monitoring dashboards.

Stefan Irimescu, Andrei Isac —Beekeeper AG

Missed the first part of the story? You can read it here.