Story of serverless (Lambda) to move large data from DynamoDB

Bala Dutt
Bala Dutt
Nov 21 · 5 min read

Sachin Maheshwari and Bala Dutt

The requirement

A system stores small-sized (~1KB) records in DynamoDB at a rate of 7k records per second (rps). This has to be then pushed to another system by way of making REST service calls. Delays are acceptable in this push, but loss is not. Instead of proposing a project, estimating effort, an experiment was done on a fine sunny day.

Serverless = short time to production

The decision was to go serverless, so that focus is on creating value quickly and not be bogged down with provisioning and running of the system. Also, serverless is excellent for glue between systems and is a new mindset, read more here. DynamoDB streams and lambda were picked as technologies. A Lambda was written to be triggered when the DynamoDB stream has new data. This lambda takes a batch of events, makes service (https) call for every event. And as mentioned, throughput is 7k rps.

Functionally correct, economical and tuned

Steps were — make it work correctly, make it economical and do this in the least development time. Remember, the quote from Donald Knuth, “The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming”. So we focussed first to make the system work functionally.

Making it work

The first question was, which language to use? Although the whole system is in Java, Python was picked. Chose the best tool for the task, don’t be afraid of polyglot. The whole code is written in the Lambda user interface and in 1hr there was a working system. So, no Githubs, builds, deploy pipeline, etc. Idea to working POC in 1hr. Serverless rocks!

Stream data were batched to call lambda with a batch of 100 events. The whole execution of 100 events took about 23 seconds, averaging 230ms for every event. This processing time was predominantly the time spent on downstream calls. Making a system is correct is very easy, but making it work at scale is where interesting things come. It turns out that this was costly to run (100 USD per day). This ran for a day and the next day a huge pileup of events was seen as Lambda was throttled. Cost would have shot up much more, but for the good engineering where limits were put on everything. Pile up of messages is seen in the iterator Age. Iterator Age indicative of the time the message has been sitting to be processed by Lambda. Scaling to process the load would mean a much higher cost. This was not the economy we wanted. The next phase was to make it economical.

No thread waiting

So, what is the problem? There are 100 events in a batch, and they result in 100 Http calls to service. Most of the time lambda is just waiting. And in this new world, the money counter is spinning. Time is money, certainly in this situation. Can these Http calls be async and poll for results once done? Async Http call in the library could not be found. Lesson is that in Lambda Python, there are limitations. Another learning, always verify your solution against what is allowed in the public cloud. So, another idea was to do this with multithreading. That gave a 50% improvement. View of the system at this point looks somewhat like below,

Resource pooling

This wasn’t the gold. The next phase was to manage resources optimally. Who doesn’t know about pooling and caching as two of the greatest tools to handle scale. A connection is a resource and we all know how costly it is to establish a connection. It takes 1.5 times of round trip time for TCP handshake, 1 time each for request and connection close. Add another 1.5 for SSL handshake. Clearly, establishing a connection for each request is very costly. To reuse connection, connection pooling was brought in. This was a killer. A huge benefit was seen in performance. Here was a solution that worked in the test environment with test data.

Nothing is straight and simple

The final phase was to set the knobs to appropriate levels. Search for a sweet spot between the number of threads, batch size and connection pool was flagged. The constraints are following,

  • Lambda will timeout at 900s, you want to finish before that in the worst case. So, keep timeouts and fail at your will rather than lambda timing you out. If Lambda times out, the whole batch has to be retried.
  • Lambda gets fixed computing power, so increasing threads a lot doesn’t benefit. On some sites, it is mentioned that Amazon runs 3 lambdas on 2 core machines, so expect 2/3 core for lambda. If the application is mostly in io/wait then there is benefit from multi-threading. Sweet spot can be found after trying out multiple thread counts.

Lambda costs are based on invocations and durations, rounded off to 100ms. So if lambda finishes in 10ms, the charge will be for 10x of the time. For Global tables in DynamoDB between two regions, streams will give all the activity including sync between them. This will give many times more events than what one would expect. It would be nice if AWS has a way to say what kind of DynamoDB events one is interested in.

Streams poll DynamoDB partitions. That is, two partitions may not be polled for a single batch, they will have their own batch. For a large number of partitions and writes on a large number of them every second, streams will give a very small batch to lambda. This polling happens 4 times a second. Concurrency count may be tuned to ensure that lambda is not called for every partition change. This is where our story landed for now. The system could not be taken live because it could never get enough events in the batch and then could not achieve economies of scale. But, the story is not over, the plan is to reduce data by moving some of the data to cold storage and get back to this solution again. Stay tuned, the climax of the story would be posted.

Conclusion

Well, even with serverless, server mindset. Here are the key learnings,

  • Serverless = short development time. Use make it work, make it economical, tune phases.
  • No thread waiting. Waiting threads are locked resources. Resource pooling for costly resources is one of the most impactful ways to scale and speed.
  • Nothing is straight and simple or say, the devil is in the details. Even though a service looks simple, there would be a lot of nuances when the scale gets to it.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade