Creating a no-code AWS native OLTP to OLAP data pipeline — Part 1

Leveraging Kafka Connect to Stream Data from MySQL RDS to S3 and make them available to Redshift without writing a line of code.

Published in

Data rocks

7 min readDec 27, 2020

In this first part, we’ll discuss what drove us and how we managed to execute this project in Nutmeg. Stay tuned for the second part where we’ll dive into more details, issues and tuning.

The Drive

Moving Data from the Transactional (OLTP) to the Analytical (OLAP) side of a company is probably one of the most common tasks of a Data Engineer, with numerous ways of doing it.

In our case, we already had in-house developed processes to transfer data from our MySQL databases to Redshift, which is our Data Warehouse, but most of them were batch-based with serious limitations. They included a large code base on different languages; multiple methodologies of transferring data; and batch copy jobs which couldn’t account for backdated changes. As Jeff Magnusson mentioned in the post Engineers Shouldn’t Write ETL, “engineers must deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy”, so we decided to embrace this and find a better solution.

Some of the thoughts we had while designing and building the new Pipeline were:

Transfer data reliably and propagate all changes from the upstream systems
Make sure the solution is elastic enough to be able to tune it per group of source tables (usually meaning to be able to trade-off latency for throughput)
Make it easily maintainable and replicable
Engineering teams should be able to self-serve and deliver data to the OLAP side by themselves
Use as much as possible existing technology and out-of-the-box tools
Develop or write as less code as possible
Leverage Kafka, since it was already part of the streaming stack

The Architecture

After having defined the goal and the parameters of the project, it was time for designing it.

We were already publishing all database changes to Kafka using Kafka Connect, so it felt natural to continue the architecture with another Connector (the S3 Sink Connector) to send the data to S3, ready to be consumed by Redshift Spectrum.

As seen below, the Architecture is pretty straightforward, but there were a few key elements that we had to iron-out to successfully and stably run it in production.

The Elements

Debezium CDC Source Connector

Change-Data-Capture (CDC) is the process of capturing the changes of a datastore and sending them downstream for further usage. For CDCing our source tables, we used the Debezium Kafka Source Connector, which is responsible for sending all the changes from MySQL DBs to Kafka. You can read more about it on the Debezium Connector for MySQL Documentation.

We were already using this Connector to stream data to Kafka and stream processes relied on it, so this part remained unchanged.

S3 Sink Connector

After having the messages in Kafka we wanted to propagate them to S3, this part was very interesting as this Connector gives you quite a few options to play around. It’s worth taking a closer look at the S3 Connect Documentation by Confluent.

Below you can find some of the most important to us. We won’t provide exact values for all as depending on the shape and type of the source data, these settings were tuned per group of topics, which we’ll discuss later.

All the available configuration options can be found here: https://docs.confluent.io/kafka-connect-s3-sink/current/configuration_options.html

Object Formatting

format.class=io.confluent.connect.s3.format.parquet.ParquetFormat : Out of all the available S3 object formats, Parquet made sense for us from many perspectives. First of all, it’s a columnar storage format matching perfectly Redshift Spectum’s query engine; secondly, it supports Google’s Snappy compression which can be read directly from Spectrum minimizing network transfer size; finally, it was the most easily understood format by the AWS Glue Crawler.

Object Uploading

Object uploading options play a very important part that will set your SLAs and will define the throughput, lag and memory consumption of your pipeline.

flush.size: The maximum number of records written per S3 object, by setting this you can limit the maximum size of your S3 files so you don’t end up with very large files in S3 that can be potentially slow to query (you’ll see how this helped us in Part 2).
rotate.schedule.interval.ms: This property will set the maximum time a file can remain open until it’s flushed to S3, this gives us an approximate max lag between the messages read from Kafka and being delivered in S3. Watch out the difference with rotate.interval.ms, which will only close a file when a record that comes belongs to the next time segment, thus potentially keeping a file open infinitely.

https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#s3-object-uploads

partitioner.class: When choosing the partitioner, we faced two case, when we didn’t know upfront the access patterns and when we did. The first one was when the source table had no particular expected read access patterns, so the DefaultPartitioner fit well enough to accommodate any read queries. The second one was for tables with columns that were almost always used in the WHERE clause when querying the data, that was either some date field or some low cardinality filtering columns. In these cases, we used the FieldPartitioner to create a directory-like structure in S3, which benefits massively the querying part, see Redshift Spectrum section. An additional feature of the FieldPartitioner is that you can add multiple fields and use them in a tree-like structure.

https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#partitioning-records-into-s3-objects

Connect Transforms

The Single Message Transforms (SMTs) are a Kafka Connect concept that we leveraged to bring the data in a more Query-friendly schema from their original one as you’ll see in Part 2. They are handy when you want to convert data in-place while sending them to/from Kafka, plus you can even develop one yourself.

S3 & AWS Glue

S3 is pretty self-explanatory and no specific settings were made here thus let’s look at AWS Glue. We used Glue as our Data Catalogue to map all the data from the S3 bucket and make them easily accessible by Redshift Spectrum (see next section). A Glue Crawler is scheduled to run periodically and scan the bucket in S3 to discover new tables and partitions. If you store all your files under the same set of partitions then you only need to run the crawler once, although in our case, there were tables where new partitions were added daily (date-based). For Spectrum to be able to efficiently run queries and only scan the files for that day, the Crawler needs to run and update the table in the Glue Catalogue.

Redshift Spectrum

With Redshift Spectrum, AWS gives you the ability to use the computational power of your Redshift cluster to query data directly in S3.

A useful guide to get started is the following, but skip the part of creating an external table manually and instead use the Glue Crawler which will make your life much easier.

https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

Note 1: By leveraging the partitioning mechanism discussed previously, you’ll be able to increase performance and limit cost by only scanning a specific prefix, based on the partition you’re interested.

Note 2: Since the data in S3 contain all the raw entries from MySQL you’ll need to do a bit of work at query time (or at rest, or both) to get the “current view” of your information. We’ll describe in-depth the way to solve this in the following part of the article.

Conclusion

Building a no-code data pipeline is both an efficient and effective way of delivering value to the company while minimising the time-to-production and investment. With such a solution your core task is glueing tools together, therefore the configuration and tuning are the biggest challenges to look out for. You need to be extra careful on all the small details and caveats of each component because they can have a significant impact, both in terms of costs and performance. But still, in terms of “development” time and maintenance effort, I believe it’s very much worth it.

That’s all for the first part of this topic, stay tuned for the second part where we’ll dive into more details, caveats, and tips on the different components of this pipeline.

References

Engineers Shouldn't Write ETL: A Guide to Building a High Functioning Data Science Department |…

"What is the relationship like between your team and the data scientists?" This is, without a doubt, the question I'm…

multithreaded.stitchfix.com

Debezium Connector for MySQL

MySQL has a binary log (binlog) that records all operations in the order in which they are committed to the database…

debezium.io

Amazon S3 Sink Connector for Confluent Platform - Confluent Documentation

The Amazon S3 sink connector periodically polls data from Kafka and in turn uploads it to S3. A partitioner is used to…

docs.confluent.io

Apache Kafka

Here is a description of a few of the popular use cases for Apache Kafka®. For an overview of a number of these areas…

kafka.apache.org

Getting started with Amazon Redshift Spectrum

In this tutorial, you learn how to use Amazon Redshift Spectrum to query data directly from files on Amazon S3. If you…

docs.aws.amazon.com

Acknowledgements

Many thanks to Francesco Nobilia and Javier Holguera for helping me with the creation of this post.