You ‘fraid of Mysql to Redshift? Euphrates!

Quick, painless data transfers into Redshift

Joshua ♥ Hull
art/work -behind the scenes at patreon
5 min readMar 14, 2017

--

Overview

Redshift is an amazing database for performing complex queries on massive sets of data, and here at Patreon, we love Redshift! Our main application runs off of a MySQL instance running in RDS (Relational Database Service), and thus, we wanted to be able to perform analysis from our Redshift instance. However, if you’ve ever tried to transfer data between Mysql RDS and Redshift, perhaps you’ve experienced the overwhelming struggle of both selecting a tool and getting it working in production. Though it would be lovely to follow these instructions, they won’t work because you can’t use SELECT INTO OUTFILE in MySQL RDS due to restrictions Amazon imposes.

I was struggling to find a solution, so I set out on a journey of trial, error, and my eventual triumph. After trying out a few different tools, which I talk about below, I finally settled on writing a new tool we’re calling Euphrates. Using this, we were able to transfer in minutes what was taking us hours before. At the end, I’ll discuss some of future hopes and dreams for this tool, and invite you to join my journey.

What we tried

Here are some of the tools I tried using before finally taking the plunge and writing something specific to this problem. My main takeaways after trying these different solutions are:

-Transferring large amount of data was unreliable

-You must either send data your through a third party or spend thousands of dollars to keep it on-premises

-Running a full transfer and validating the results takes hours

-All of these solutions make debugging incredibly hard

DMS

This is Amazon’s “in-house” offering. DMS stands for Data Migration Service, but it is, as far as I can tell, Attunity Replicate repackaged by Amazon. I tried for weeks to get this to work and while it worked great for small numbers of tables, when I attempted to transfer a larger subset it would silently fail near the end of the transfer. As well, for our analytics dataset it would take around eight hours to transfer, making validation tedious.

The tool itself, being closed source, and the host running the migration service being unavailable for direct access, makes debugging incredibly difficult. You ultimately have to settle with either reading Attunity’s support boards or filing a ticket with Amazon, and there, YMMV.

AWS Data Pipeline

I have to include this option even though it wasn’t an option for us due to it’s unavailability in our region. It might be excellent — I wish I had experience running it. If you have any feedback about this tool, I’d love to hear it!

Matillion ETL

This tool appears to be quite costly to run. It appears primarily suited for non-coders as it offers a Web-based GUI for defining data pipelines. I wasn’t keen to operationalize this option as it wouldn’t play well with our configuration management system.

Other third-party services

Many other third-party services were available for this as well, but we didn’t want to ship our data off-premises if possible.

Building Euphrates

After trying out all of these options, the promise of having a full replica of MySQL data in Redshift seemed just as far off. With everything I had learned to this point, I thought I’d try writing a little prototype of a data transfer system for MySQL to Redshift. I knew anecdotally that mysqldump was considered the fastest way to get data out of MySQL, but I couldn’t find a suitable parser for the mysqldump output. That’s when I stumbled across the — xml for mysqldump, and with that, I started prototyping a parser and importer.

XML dumps not as advertised

In attempting to parse and transfer the XML dump files, I would get XML parsing issues around characters such as 0x3 being included in the dump. It turned out mysqldump would include non-printable characters below 0x20 if they are in the original data. While these characters are valid in XML 1.1, they are not valid in the XML 1.0 file mysqldump provides. After trying a variety of solutions (such as modifying the XML version to 1.1), it ultimately proved to be simpler to filter out these non-printable characters and parse the result. As excising these non-printable characters wouldn’t have any impact on our analysis, not including them seemed a simple solution.

Finally, success!

Once I was able to completely parse the XML dumps I was able to start migrating large tables. Through tuning our heap and introducing manifest files, I doubled our importing speed.

Ultimately, our bottleneck in Redshift was the speed of running COPY commands. After a few trial runs it appeared maximizing the amount of data sent in a single manifest file increased throughput. Running on a c4.2xlarge with a 12GB heap, our 100GB dataset takes ~45 minutes to copy. Given this speed, we can afford to run several transfers a day.

This kind of work wouldn’t have been possible without an awesome team. Here at Patreon, we’re all pretty supportive of each other’s wild and crazy dreams (except Zach’s dream to cosplay as the Mountain Goats each and every day), and the people around me have been hella helpful in terms of getting this prototype off the ground. Manuel Andere, one of our amazing t-shaped data science folks, was especially helpful in helping me get this into production in a remarkably short period of time.

The Future

That’s not to say we’re done. This primitive solution is valuable to us but we want to parse the binlog and do live replication just like many other tools offer. This presents certain challenges on Redshift specifically. Even in the dumping case, it can do a better job picking an order to dump tables.

Feedback & links

Please, check out Euphrates. I’ve included some notes on setup and operation, and will update it as I continue to learn more. Any feedback, criticism, corrections or other updates, I’m super happy to check out.

--

--