Building Scalable Data Pipelines Using Secor and Presto — Part 1

Building Scalable Data Pipelines Using Secor and Presto is a two-part series detailing our experience adopting Secor and Presto for our latest data engineering project. This series covers several topics including S3, Kafka, Mesos, Hive, and Parquet.


Here at Chartbeat, the Platform team has been wrapping up work on a neat little data engineering project. Using Pinterest’s open-source Kafka log persistence service Secor and Facebook’s SQL query engine Presto, we built a data pipeline that ingests 40k messages per second with a SQL interface for interactive querying, all on top of our recently-adopted Mesos/Aurora architecture. This work will serve as the data source for several of our products whose upstream data flows from S3, positioning us to leverage more cost-effective database technologies. Along the way, we gleaned some important life lessons that should be relatable to those who’ve had the pleasure of somberly contemplating their past architectural decisions. The story that follows is one of hardship, heroism, and kittens.

Setting the Scene

Back in early 2016, a group of Chartbeat engineers set their skillful hands to the task of building a data pipeline that would power some cool new products. At the start of this pipeline, there’s a service named Free Kitten which consists of two programs. The first is a Kafka consumer that ingests messages from one of our Kafka topics and writes LZO-compressed CSVs to disk. The second is a python spool uploader that runs on a five-minute cron, uploading these CSVs to S3. Worker services import the S3-stored CSVs into AWS Redshift on an hourly basis, where product-specific ETL workers and data scientists perform their data analysis tasks.

Free Kitten data pipeline

This went mostly according to plan; pipelines were built, data scientists did whatever data scientists do, and products were launched. Mission accomplished! Or so we thought…

While building this data pipeline, plans for migrating our services to Mesos/Aurora were nothing but faint rumors, discussed in hushed tones during late hours with beers in hand. Fast forward to present day, and most of our services are running smoothly on the new Mesos/Aurora architecture, with all meaningful metrics (cpu/memory utilization, server costs, engineer quality-of-life) improving substantially as a result. While many of our services made the cutover without too much trouble, there was one service that would not go quietly into the Mesos night.

Free. Kitten.

There’s No Such Thing as a Free Kitten

To understand what prevents Free Kitten from running on Mesos, here’s a quick primer on how the service works. Kafka messages read by the Free Kitten consumer have their offsets committed after they are written to disk, before they are uploaded to S3. This works well enough since Free Kitten services live on their own Amazon EC2 instances. If an instance of Free Kitten dies, it could start back up to find its CSVs still on disk and proceed to upload them to S3 without issue.

But if this service were to run on Mesos and an instance was killed for whatever reason, there would be zero guarantee that it would restart on the same box. Those messages that were still on disk waiting to be uploaded to S3 would be lost. To make this service resilient to outages, both writing to disk and uploading to S3 would have to be managed by the same program. Offsets would have to be committed after messages were successfully uploaded to S3. Free Kitten was due for a complete makeover.

Easy Refactor = Famous Last Words

If all of this refactor talk sounds like a comical amount of work, then we are in complete agreement. Refactor tasks of this complexity are ripe with risks. Even if we pulled it off without so much as a syntax error, all of that time spent building, testing, and migrating Free Kitten 2.0 would still end up being a considerable opportunity costs. With these potential roadblocks weighing heavy on our hearts, we took a hard, sober look at our available options:

  1. Refactor Free Kitten to be fault-tolerant and Mesos-ready
  2. Exile Free Kitten to life outside of the Mesos realm
  3. Explore off-the-shelf open-source solutions

No one on the Platform team was particularly excited at the prospect of refactoring such a critical and sensitive part of our infrastructure. Leaving Free Kitten outside of Mesos didn’t sit right with us either. Few good things could come of maintaining two architectures indefinitely. Before bidding farewell to our loved ones and setting forth on quest to refactor Free Kitten, we decided to check if any open-source solutions could get us out of this mess. Our search began with the following criteria:

  • Fault-tolerant: Should a service face an untimely demise, whether at the hands of Mesos or some other invisible force, not losing any of our precious data would be nice.
  • Horizontally scalable: In case of performance issues, we would like to just throw more services at the problem without a fuss.
  • Fits our tech stack: Because trying to convince my fellow engineers that we should adopt some cool Ruby project will cause a lot of laughs, all at my expense.

With these features in mind, some internet sleuthing revealed a promising contender.

Secor - It Just Works

Secor is an open-source java project developed by Pinterest that consumes Kafka messages and saves them to your favorite cloud storage platform, ensuring that writes occur exactly once. Its horizontally scalable, supports several popular output file formats, handles partitioning for import into Hive (more on Hive in pt. 2), and includes a backup raw message writer component in case you muck up your message parsing.

Secor addresses the main shortcoming in Free Kitten’s design with refreshing simplicity. The consumer handles both message writing to disk and the uploading to S3 simultaneously. Offsets are committed only after their respective messages are uploaded to S3. Written files are set to delete upon program exit, preventing loose files from hanging around on disk should an instance of Secor die and restart elsewhere on our Mesos cluster. So far, so good.

Data Partitioning Made Easy

Out of Secor’s many great features, my favorite has to be its Hive-friendly message writing. Here’s an example of how the two consumers write messages read from Kafka to S3:

Free Kitten and Secor message writing example

A few things worth noting. Instead of writing aKafka message based on the wall clock at the time it was received (I’m looking at you, Free Kitten), Secor examines the message timestamp field of your choice and writes to the respective date-time partition path. Secor’s style of writing file paths based on date-time — for example “/dt=2017–08–01/hr=04”, makes the next step of setting up Hive partitions pain-free, which will be covered more in-depth in Part 2.

Notice in our example how the message’s timestamp is a few minutes before the wall clock of when the messages are received by the consumers. Due to the unreliable nature of the Internet, this scenario of messages trickling well after their timestamp occurs frequently. For downstream services ingesting this data, messages being written to their appropriate date-time locations allow for straightforward import logic. The fact that Secor accomplishes this seamlessly, with almost no lift on the developer’s end, is pretty sweet.

With these awesome features that Secor provides, we were able to write a custom JSON message writer, deploy it on our Mesos cluster, and tweak its performance to get it reading 40k messages per second without a hitch.

Secor data pipeline

To those reading who happen to be in the market for a Kafka log persistence service that just works, consider this a giant, ringing endorsement for Secor. Huge shout to the devs over at Pinterest for making this little gem open-source. If I wore a hat, I’d tip it in their general direction.

Lessons Learned

No journey would be complete without important life lessons to share. While the Secor portion of this story only marks the first half of this data engineering odyssey, here are some insights that we gathered from our adventure thus far:

Ask tough questions (the earlier, the better)

Even for young, hip startups that like to move fast, seriously asking yourself and your team “Will this service’s design stand the test of time?” can make future-you much less miserable. Ensure that future-you ends up appreciating the foresight and diligence of present-you.

The World Wide Web (use it)

Before you embark on a fools errand to reinvent the wheel, try a quick internet search for “wheel”. Unless you think you can do better (you can’t), rejoice at the realization of how much time you just saved yourself. Don’t be a hero; be like Newton and go stand on some giant’s shoulders.

Pay it forward

Even though your next open source project probably won’t set hacker news on fire, it might just serve a niche other engineers are desperately looking for. If you find something you’ve built to be genuinely useful, someone else probably will too.

Parting Thoughts

Using Secor to reliably persist Kafka messages to S3 has turned out very well for us. The small leap of faith we took with a lesser-known open-source project paid off tremendously, freeing us up to invest time in experimenting with several query engines that we could potentially use to read our brand-new Secor data.

What’s to Come

For part two, we’ll expand on our adoption of Facebook’s open source project Presto. We’ll cover what Presto does well, our experience getting it up and running, and how we’re using this SQL query engine to side-step some costly database dependencies by reading straight from S3.