Handling medium size files with Lambda
We have many big data tools but we still need some tools for not so big data
Below is a workflow to effectively use Amazon’s serverless offerings to load medium size files into a database, whether SQL or otherwise.
MapReduce is a programming paradigm where big data is processed by mapping a large file to several reducers for parallel processing.
We can effectively user Lambda to map, use SQS as a coordinator, and use Lambda again as a reducer. With this serverless approach the infrastructure is provisioned and we can control parallel execution without the headache of maintenance.
Why not just use a single Lambda
The issue is that an input file might be arbitrarily large and no matter how much memory we provision to our Lambda function, we have the risk of eventually running out of memory.
The map reduce pattern breaks the arbitrary large file (e.g. millions of rows) into an arbitrary number of files of a fixed size (e.g. n files of 100k rows). Now that we have fixed the size of the file to process, we can safely provision the right amount of memory for our Lambda function.
How come breaking a large file into many small ones does not cause memory problems?
The issue here is of read-write latency. We can read a large csv file in constant memory if we use a file stream and tightly control the part of the file we are holding in memory.
We can also successfully write that same stream back into another csv in constant memory as the write operation to a file stream is about the same speed as the read latency. This means reading a file of 10 million rows, and writing them in chunks of 100,000 row files only requires the memory required to hold the 100,000 rows in memory and the overhead of the file read and write operation.
However, making a network call to a database, with a write operation, changes the dynamics of the read-write latency. A database, in the most simplest of terms, is simply a file, that takes more time to write to, then simply appending to a file. As a result, constant memory becomes impossible during linear code execution, and memory consumption grows over time. We would instead have implement thread-blocking code, which doesn’t sit well with event loop models like NodeJS. Thread blocking also adds a different problem — of eventually reaching the 15 minute Lambda threshold.
Naturally, if you are handling real big data (in 2019 terms — think over 100Gb) Lambda won’t be the right option, but there are still quite a few usecases of millions of rows (~50–100mb files) that need a low overhead way of processing.
Essentially, we are playing with space and time. In order to not run out of space we distribute processing in time; in order to minimize costs we delegate waiting to SQS — where time does not matter.