MySQL: From Fivetran to Debezium

Albert Franzi
Albert Franzi
Published in
7 min readSep 7, 2024

This post outlines how we transitioned from using MySQL Fivetran connectors to a self-hosted Debezium solution, which led to a $250,000 reduction in our annual expenses.

Context

At Lokalise.com, we manage multiple MySQL instances responsible for handling all translation processes (both l10n and i18n). As the volume of text data transferred between our MySQL databases and our Snowflake instance grew up, we relied on Fivetran’s MySQL connector for Change Data Capture (CDC). However, Fivetran’s pricing model, based on Monthly Active Rows (MAR), quickly became unsustainable as our customer base expanded. It was when we lost a significant discount in the yearly renewal, when our bill increased sixfold 💸.

In search of a cost-effective alternative, we decided to self-host Debezium on our EKS (Kubernetes) cluster. This allowed us to directly capture all CDC events from MySQL binlogs, bypassing the rising costs associated with Fivetran.

Note: Our current costs are around $700/month.

The infrastructure

Pipeline overview E2E

Before diving into the details of deploying and connecting Debezium, it’s important to note that we already have an EKS (Kubernetes) cluster in place, managed by ArgoCD for Helm-based deployments.

This article won’t cover the setup of these components but will instead focus on:

  • Deploying Apache Kafka (MSK) with SCRAM authentication and Access Control Lists (ACLs).
  • Deploying Debezium connectors using Strimzi on Kubernetes.
  • Storing data in S3 and forwarding it to Iceberg (Glue) and Snowflake.

1. Deploying a Kafka Cluster MSK

Over the past few years, AWS has significantly improved MSK (Managed Streaming for Apache Kafka), making it a reliable and relatively affordable option for running Kafka in the cloud. Given its feature set and maturity, we decided to adopt it for our CDC pipeline.

Deploying MSK Using Terraform

To simplify the deployment, we leveraged the AWS-provided Terraform module (terraform-aws-modules: msk-kafka-cluster), allowing us to spin up a MSK cluster quickly and consistently.

Note: We highly recommend configuring the MSK cluster to not be exposed to the public internet for security reasons.

Authentication and Authorization

For client authentication, we enabled both IAM and SCRAM mechanisms. The IAM authentication is used to enable the Kafka Terraform Provider, while SCRAM is used to specify user ACLs.

The Kafka Terraform Provider is configured with the necessary permissions, as we granted MSK access to our Terraform IAM role, enabling authentication and authorization for ACL resources creation.

Configuring Kafka ACLs

We set up two distinct ACL modules (see: Kafka ACL docs):

  • ACL Consumers: Read-only access, defined with a topic prefix (e.g., lok.mysql).
  • ACL Producers: Write access, also defined with a topic prefix (e.g., lok.mysql)."

In addition to defining these ACLs, each Kafka User ACL must be stored in AWS Secrets Manager. This allows us to securely manage user credentials and associate them with the MSK cluster during instantiation (see: aws docs).

The following code snippet demonstrates how to register a new Kafka user by generating a random secret in AWS Secrets Manager, which can be associated with the MSK instantiation. Refer to the AWS documentation for details on MSK limitations.

Note: These modules are based on topic prefixes for simplicity, but they can be adapted according to your specific business requirements.

2. Deploying our Strimzi Debezium connectors

Strimzi is an excellent project that enables us to run Apache Kafka clusters on Kubernetes. To deploy Debezium connectors, we first need to deploy the Strimzi-kafka-operator, allowing us to later manage the connectors with simple Helm charts.

Key Strimzi Components

To enable Debezium on Strimzi, we needed to deploy the following components:

  • KafkaConnect: Manages the configuration of the Kafka connectors (see: schema)
  • KafkaConnector: Defines and manages the individual connectors themselves (see: schema)

One major advantage of KafkaConnect is the ability to specify the set of plugins it supports. In our case, we needed to include the Debezium MySQL plugin. Upon deploying KafkaConnect, Strimzi automatically initiates a build pod, which downloads the necessary plugins, builds, and publishes a new ECR image. The running pod will start with these plugins pre-installed.

Note: The pod ServiceAccount will need to be associated with IAM permissions so it can push the image to the registry.

Deploying Debezium Connectors

Once the KafkaConnect service is up and running, you can deploy your Debezium connectors via Helm charts. The following is an outline of our typical deployment flow:

  1. Define your KafkaConnector configuration (e.g., specifying source databases, topics, and data serialization formats).
  2. Apply the configuration to the cluster using the Strimzi KafkaConnector resource.

Note: To visualize cluster resources and inspect your topics, I highly recommend deploying the Kafka-UI tool by Provectus (see: helm chart).

3. Monitor Debezium

When working with large volumes of critical data, monitoring becomes essential to ensure the system is running smoothly and to receive alerts when issues arise.

KafkaConnect includes a feature that allows exporting metrics to Prometheus via the jmx_prometheus_exporter. This makes it easy to collect and visualize important metrics about your KafkaConnect and Debezium setup.

The Strimzi GitHub project provides several useful examples that you can follow to implement this.

One of the most important metrics to track is the Time Behind Source, which can be visualized using Grafana. This metric shows the delay between when data is captured and when it is processed, which is critical for ensuring low-latency data streaming.

Grafana graph: Time Behind Source

To enable Prometheus metrics collection, you’ll need to annotate the KafkaConnect pod with Prometheus annotations so that the Prometheus server can discover and scrape the metrics.

Additionally, you will need to configure the JMX exporter by defining the jmx-exporter-config, which ensures that Kafka and Debezium metrics are exposed to Prometheus in a clean and structured format.

4. Spark Streaming to S3

In the next article, I will cover Spark Streaming and the integration with Iceberg and Snowflake. By splitting the topics across articles, I aim to keep each post focused on a specific area, making it easier to digest without becoming too lengthy 🤓.

Given the importance of security and performance in this area, I’ve included some key notes and recommendations to enhance these aspects in your deployment.

🔒Security notes

Securing the MSK Cluster

When deploying MSK, ensuring that your cluster is secure is paramount. Here are several best practices to consider:

  1. Network Isolation:
    It is crucial to ensure that your MSK cluster is not exposed to the public internet. Always deploy MSK within a private subnet and use VPC endpoints to control access to the cluster. This helps limit exposure and reduces the risk of unauthorized access.
  2. Encryption:
    Enable encryption at rest and in transit for all data handled by MSK. AWS MSK natively supports encryption using AWS KMS, so you can ensure that all sensitive data is securely encrypted before being written to storage. Also, ensure that TLS (Transport Layer Security) is enabled for encryption in transit between Kafka brokers, producers, and consumers.
  3. IAM Access Control:
    AWS MSK integrates with IAM to control who can create, modify, or delete resources in your Kafka cluster. By enforcing least-privilege IAM roles, you can restrict access to MSK and its resources, ensuring that only authorized users and services can interact with Kafka topics.
  4. Monitoring and Alerts:
    Set up AWS CloudWatch metrics and Kafka-specific alerts to monitor the health of your MSK cluster in real-time. Configure alarms for key metrics such as disk usage, consumer lag, and broker CPU utilization. Monitoring helps ensure you are alerted to performance degradation or potential security threats, such as unauthorized access attempts.

Kafka Access Control Lists (ACLs)

Kafka ACLs are vital for managing who can produce and consume messages in your Kafka topics. Here’s how you can strengthen security at the Kafka level:

  1. Granular Access Control:
    Define granular permissions using prefix-based ACLs to restrict access to specific topics. This minimizes the risk of unauthorized data access. For example, an ACL might grant read-only access to topics that begin with lok.mysql, ensuring only consumers with proper credentials can access these topics.
  2. Client Authentication:
    Ensure that SCRAM or Mutual TLS (mTLS) is enabled for all clients interacting with your Kafka cluster. This ensures that both the client and the server mutually authenticate each other, preventing unauthorized clients from accessing your topics.
  3. Periodic Audit:
    Regularly audit your Kafka ACL configurations to ensure that only necessary permissions are granted. Over time, permissions can accumulate, and unused ACLs may become a security risk if not revoked.

Performance Tuning for Kafka and Debezium Notes

Connector Task Scaling:
The tasksMax configuration in Debezium allows you to parallelize the workload by running multiple tasks for a single connector. For high-throughput environments, increase the number of tasks to distribute the load across multiple nodes in the Kafka cluster.

Note, because the MySQL connector always uses a single task, increasing it will has no effect, but check the documentation for other engine connectors.

Tuning Kafka Partitions:
For large datasets or when dealing with many concurrent producers and consumers, adjust the number of partitions for your Kafka topics. More partitions allow Kafka to process data in parallel, improving throughput and reducing lag in high-traffic scenarios.

Connector Offsets and Retention Policies:
To ensure fault tolerance and data consistency, Kafka stores offsets for each connector. You can fine-tune the offset retention period by adjusting the Kafka retention policy, ensuring that connector states are maintained even in the event of a failure.

Resource Allocation for KafkaConnect:
When running KafkaConnect on Kubernetes, allocate sufficient CPU and memory resources to avoid bottlenecks. Monitor your pod performance with tools like Prometheus and set up autoscaling for Kubernetes pods if necessary.

Links of Interest

--

--