theburningmonk.com
Published in

theburningmonk.com

how to do fan-out and fan-in with AWS Lambda

In the last post, we look at how you can imple­ment pub-sub with AWS Lamb­da. We com­pared sev­er­al event sources you can use, SNS, Kine­sis streams and DynamoDB streams, and the trade­offs avail­able to you.

Let’s look at anoth­er mes­sag­ing pat­tern today, push-pull, which is often referred to as fan-out/fan-in.

It’s real­ly two sep­a­rate pat­terns work­ing in tan­dem.

Fan-out is often used on its own, where mes­sages are deliv­ered to a pool of work­ers in a round-robin fash­ion and each mes­sage is deliv­ered to only one work­er.

This is use­ful in at least two dif­fer­ent ways:

  1. hav­ing a pool of work­ers to car­ry out the actu­al work allows for par­al­lel pro­cess­ing and lead to increased through­put

In the sec­ond case where the orig­i­nal task (say, a batch job) is par­ti­tioned into many sub­tasks, you’ll need fan-in to col­lect result from indi­vid­ual work­ers and aggre­gate them togeth­er.

fan-out with SNS

As dis­cussed above, SNS’s invo­ca­tion per mes­sage pol­i­cy is a good fit here as we’re opti­miz­ing for through­put and par­al­lelism dur­ing the fan-out stage.

Here, a ventilator func­tion would par­ti­tion the expen­sive task into sub­tasks, and pub­lish a mes­sage to the SNS top­ic for each sub­task.

This is essen­tial­ly the approach we took when we imple­ment­ed the time­line fea­ture at Yubl (the last start­up I worked at) which works the same as Twitter’s time­line — when you pub­lish a new post it is dis­trib­uted to your fol­low­ers’ time­line; and when you fol­low anoth­er user, their posts would show up in your time­line short­ly after.

fan-out with SQS

Before the advent of AWS Lamb­da, this type of work­load is often car­ried out with SQS. Unfor­tu­nate­ly SQS is not one of the sup­port­ed event sources for Lamb­da, which puts it in a mas­sive dis­ad­van­tage here.

That said, SQS itself is still a good choice for dis­trib­ut­ing tasks and if your sub­tasks take longer than 5 min­utes to com­plete (the max exe­cu­tion time for Lamb­da) you might still have to find a way to make the SQS + Lamb­da set­up work.

Let me explain what I mean.

First, it’s pos­si­ble for a Lamb­da func­tion to go beyond the 5 min exe­cu­tion time lim­it by writ­ing it as a recur­sive func­tion. How­ev­er, the orig­i­nal invo­ca­tion (trig­gered by SNS) has to sig­nal whether or not the SNS mes­sage was suc­cess­ful­ly processed, but that infor­ma­tion is only avail­able at the end of the recur­sion!

With SQS, you have a mes­sage han­dle that can be passed along dur­ing recur­sion. The recursed invo­ca­tion can then use the han­dle to:

  • extend the vis­i­bil­i­ty time­out for the mes­sage so anoth­er SQS poller does not receive it whilst we’re still pro­cess­ing the mes­sage

A while back, I pro­to­typed an archi­tec­ture for pro­cess­ing SQS mes­sages using recur­sive Lamb­da func­tions. The archi­tec­ture allows for elas­ti­cal­ly scal­ing up and down the no. of pollers based on the size of the back­log (or what­ev­er Cloud­Watch met­ric you choose to scale on).

You can read all about it here.

I don’t believe it low­ers the bar of entry for the SQS + Lamb­da set­up enough for reg­u­lar use, not to men­tion the addi­tion­al cost of run­ning a Lamb­da func­tion 24/7 for polling SQS.

The good news is that, AWS announced that SQS event source is com­ing to Lamb­da! So hope­ful­ly in the future you won’t need workarounds like the one I cre­at­ed to use Lamb­da with SQS.

What about Kinesis or DynamoDB Streams?

Per­son­al­ly I don’t feel these are great options, because the degree of par­al­lelism is con­strained by the no. of shards. Whilst you can increase the no. of shards, it’s a real­ly expen­sive way to get extra par­al­lelism, espe­cial­ly giv­en the way reshard­ing works in Kine­sis Streams — after split­ting an exist­ing shard, the old shard is still around for at least 24 hours (based on your reten­tion pol­i­cy) and you’ll con­tin­ue to pay for it.

There­fore, dynam­i­cal­ly adjust­ing the no. of shards to scale up and down the degree of par­al­lelism you’re after can incur lots unnec­es­sary cost.

With DynamoDB Streams, you don’t even have the option to reshard the stream — it’s a man­aged stream that reshards as it sees fit.

fan-in: collecting results from workers

When the ventilator func­tion par­ti­tion the orig­i­nal task into many sub­tasks, it can also include two iden­ti­fiers with each subtask?—?one for the top lev­el job, and one for the sub­task. When the sub­tasks are com­plet­ed, you can use the iden­ti­fiers to record their results against.

For exam­ple, you might use a DynamoDB table to store these results. But bare in mind that DynamoDB has a max item size of 400KB includ­ing attribute names.

Alter­na­tive­ly, you may also con­sid­er stor­ing the results in S3, which has a max object size of a whop­ping 5TB! For exam­ple, you can store the results as the fol­low­ing:

bucket/job_id/task_01.json 
bucket/job_id/task_02.json
bucket/job_id/task_03.json
...

Note that in both cas­es we’re prone to expe­ri­ence hot par­ti­tions — large no. of writes against the same DynamoDB hash key or S3 pre­fix.

To mit­i­gate this neg­a­tive effect, be sure to use a GUID for the job ID.

Depend­ing on the vol­ume of write oper­a­tions you need to per­form against S3, you might need to tweak the approach. For exam­ple:

  • par­ti­tion the buck­et with top lev­el fold­ers and place results in to the cor­rect fold­er based on hash val­ue of the job ID
bucket/01/job_id_001/task_01.json
bucket/01/job_id_001/task_02.json
bucket/01/job_id_001/task_03.json
...
  • store the results in eas­i­ly hash­able but unstruc­tured way in S3, but also record ref­er­ences to them in DynamoDB table
bucket/ffa7046a-105e-4a00-82e6-849cd36c303b.json 
bucket/8fb59303-d379-44b0-8df6-7a479d58e387.json
bucket/ba6d48b6-bf63-46d1-8c15-10066a1ceaee.json
...

fan-in: tracking overall progress

When the ventilator func­tion runs and par­ti­tions the expen­sive task into lots small sub­tasks, it should also record the total no. of sub­tasks. This way, it allows each invo­ca­tion of the worker func­tion to atom­i­cal­ly decre­ment the count, until it reach­es 0.

The invo­ca­tion that sees the count reach 0 is then respon­si­ble for sig­nalling that all the sub­tasks are com­plete. It can do this in many ways, per­haps by pub­lish­ing a mes­sage to anoth­er SNS top­ic so the worker func­tion is decou­pled from what­ev­er post steps that need to hap­pen to aggre­gate the indi­vid­ual results.

(wait, so are we back to the pub-sub pat­tern again?) maybe ;-)

At this point, the sink func­tion (or reduc­er, as it’s called in the con­text of a map-reduce job) would be invoked. See­ing as you’re like­ly to have a large no. of results to col­lect, it might be a good idea to also write the sink func­tion as a recur­sive func­tion too.

Any­way, these are just a few of the ways I can think of to imple­ment the push-poll pat­tern with AWS Lamb­da. Let me know in the com­ments if I have missed any obvi­ous alter­na­tives.

Like what you’re reading but want more help? I’m happy to offer my services as an independent consultant and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools.

I’m based in London, UK and currently the only UK-based AWS Serverless Hero. I have nearly 10 years of experience with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve here.

I can also run an in-house workshops to help you get production-ready with your serverless architecture. You can find out more about the two-day workshop here, which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices.

If you prefer to study at your own pace, then you can also find all the same content of the workshop as a video course I have produced for Manning. We will cover topics including:

  • authentication & authorization with API Gateway & Cognito

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yan Cui

AWS Serverless Hero. Helping companies go faster for less with serverless.