Lean DevOps: Data Pipelines with AWS Firehose

80% immediately can be better than 100% two (or 10) weeks hence

Marcin Tustin
Build and Learn
7 min readApr 10, 2017

--

Why you care

You want to ingest a stream of data into something, and you want to be able to normalise it, enrich it, and send it somewhere without having to manage infrastructure, think about retries, or do much to figure out monitoring.

Maybe you want to replace a batch system in whole or part with stream processing. Most of all, you don’t want to burn a ton of time on operations. You should be cool with not needing the sophisticated primitives of a more infrastructure-heavy system.

Why Firehose + Lambda?

It works pretty well, you can get set up with it almost instantaneously, and it’s not especially slow. Its composition story is OK. It’s really, really cheap.

The more you wish you didn’t have to monitor the liveness of processes or containers, or figure out how to dynamically provision more processing or delivery capacity, the more you should consider whether some kind of setup like this makes sense.

Comparison with Kafka

You can do much of what you would do with Kafka, plus a Kafka Streaming cluster of applications, plus Kafka Connect. Probably the latency won’t be as low; probably the hassle of managing a complex topology would be lower with a Kafka setup. You won’t get easy partial rewind capability with AWS. You’ll need to figure out your own tables functionality.

The flip side is that when you have it running reliably, you won’t be subject to any of Kafka’s failure modes, like disks filling up, split quorum, or random offset rewinds.

Pick the solution that works best for you.

Comparison with Flink, Spark, and the like

This setup can also be suitable replacement for a flink or spark streaming application running on a cluster if you’re only doing simple things. Map (any per-item operation), and filter are naturally expressed in this setup. Reduce and window operations will require you to either implement them yourself, or rely on some external system that can provide that functionality. Some reduce operations are extremely simple to implement, like counting, and any per-key operation. If windowing can be accomplished by mapping (perhaps with the assistance of a look-aside system like redis) then key-based processing becomes much easier.

It’s about simplicity in infrastructure and getting things done right now if you can bear the limitations of both the infrastructure, and largely having to roll your own code. The more you wish you had rich primitives, the more it’s a sign that you’ve outgrown this setup.

Simple Topologies

Two practical examples using a single firehose delivery pipeline and up to two lambdas.

Practical example: Webhook json data into Redshift with no code at all

Here’s a picture. I’m not going to show any code because there isn’t any real code other than the cloudformation you could use. And the complexity such as it is is in the API Gateway configuration, which I also don’t want to talk about in this post, because it has its own complexity. Also here’s a similar blog post from AWS: https://aws.amazon.com/blogs/aws/amazon-kinesis-setting-up-a-streaming-data-pipeline/

This assumes that you have json data, or you can jsonise it in the API gateway (which has a turing complete programming language for manipulating the data posted to you; not that I’d recommend writing anything too complex in that), and that the data is fairly regular. The next recipe focuses on making regular data out of irregular data. You’ll also need to configure Redshift to know how to map your JSON to the receiving table.

The components of a simple delivery pipeline

API Gateway can be configured to send payloads into Firehose — it knows about many AWS services so you don’t need to figure out the integration yourself. Amazon also has a client for Firehose, so if you can embed access in an app, you may not even need the API gateway.

Practical example: Grody, weird CSV to structured data in ElasticSearch

In this setup, we’re taking CSVs that represent scraped data from websites, and transforming each line into a document that can be efficiently searched in ElasticSearch. The CSVs are semistructured data, in that they have different sets of columns, in different orders (fortunately for me in this scenario they were all headered CSV), and a lot of the data was in text that had varying formats, but contained information that boiled down to something that could be represented in a structured way.

In this scenario, you have Lambda (1) that receives the notification that a file has been written, and breaks it up into individual documents that the Firehose delivery stream can process (elasticsearch takes individual documents, and the delivery stream cannot break them up itself). In my setup this also does all the enrichment, regularisation, and quality enhancement that can be performed without calling any other system. The advantage of doing it at this stage (especially if you do not discard any information, which I strongly recommend) is that all further processing can rely on your data conforming to a defined schema. I use avro to define the schema, and use automated testing offline to ensure that all results conform to the desired schema.

The other advantage of doing the transformation up front is that you can attach as much provenance data as you have at that point, which will ease tracking the data back to the source as much as possible.

The delivery stream just coordinates batching, calling Lambda (2) and delivery into elasticsearch. Lambda (2) does the work of calling google (with any caches you may interpose) and enriching the data with that. In our setup it also has some rate limiting to avoid overwhelming either the google API or elasticsearch.

Building the exact elasticsearch application you need may require you to set up custom schema templates in elasticsearch, and you’ll need to pick the level of capacity you provision for elasticsearch in terms of number of machines. Broadly speaking the number of processors in the elasticsearch cluster will determine how quickly you can receive data into it in terms of number of batches being uploaded at once. There’s no reason why you couldn’t use a redshift database as the final destination.

The monitoring and operations story

This is the really good bit. Delivery pipeline and lambda have integrated cloudwatch logging and metrics. Cloudwatch will serve you up graphs (predefined graphs exist for Lambda and Firehose) and can trigger alerts if any part of the system fails to complete perfectly.

Similarly, both systems have built in retries, so you don’t need to program that logic yourself for transitory problems.

You don’t have to provision any machines yourself or manage cluster capacity.

The backpressure story

There isn’t one. Or rather, like with lots of things in AWS, and Kinesis in particular, it’s up to you to make it yourself.

The simplest approach is to throttle the rate of processing a batch in a transformation lambda, possibly with sleeps. You’ll need to reduce the batch size so that the transformation lambda isn’t killed (because it goes over the 5 minute limit). Obviously this is the opposite of a reactive architecture, and if you want to monitor the health of whatever system is providing backpressure and possibly trigger some kind of extra capacity provisioning, you’ll need to figure that out yourself.

Yeah, it’s not slick.

The replay/reprocessing story

In both of the setups shown, you’ll have data in s3 objects. As long as you have a lambda that can take some data from an object in s3, and send it into Firehose after whatever processing, you can find the objects with the data you want, and send them to your lambda. Again, something of a stone soup approach, but hopefully you will very rarely intentionally want to do a rewind of your data after you’re done with development. In addition, because you can define what is sent to your lambda, this can be a more flexible approach than simply picking an offset in a topic (although e.g. Kafka offers keying so with some thought you could achieve a similar effect there).

Robustness and fault tolerance

Firehose handles both retries of delivery, and retries of lambda invocation. Lambda has its own retry mechanism. If those fail, Firehose has the ability to configure a dead letter resource (usually an S3 bucket) to receive the messages that failed to process.

Transformation lambdas may also mark records as failed, and they will also be sent to the dead letter resource.

In respect of the trigger lambdas, there are retries (as mentioned above) but no explicit dead letter functionality. Keep these constant memory and tune them so they don’t fail. The more you have detailed source information going forwards, the more you can piece together what did and did not process completely and where.

The deployment story

Lots of frameworks for Lambda now exist, and the good ones should be able to define the triggers you need. I’ve used Kappa (which is python focused) which does most of what one wants. To deploy the rest of the infrastructure, you’ll need to orchestrate your cloudformation however you do it now. Again, a lot of tools in this area if you don’t want to go CI + scripts + awscli + raw cloudformation.

Chaining and complex topologies

By making use of triggers on S3 buckets, as in the second example, you can use S3 buckets as persistent stores for your “topics” and have one lambda with a routing table that knows how to dispatch the data to where it needs to be processed next. With this approach you can have topologies where data flows fork and join.

If this sounds a bit like the early days of hadoop, where you would rely on HDFS to orchestrate workflows, it is, except that this stuff is all in the cloud and you don’t need to support much of it other than writing the code.

--

--

Marcin Tustin
Build and Learn

Data Engineering and Lean DevOps consultant — email marcin.tustin@gmail.com if you’re thinking about building data systems