Processing huge files from S3 asynchronously one line at a time in node.
Update: All of this is unnecessary, for a much simpler way to achieve this, and the reasons this is unnecessary, read part 2.
Recently We had an issue, we needed to process every line of a csv
file stored in AWS s3, the processing could take a variable amount of time and required several asynchronous calls to downstream systems.
The naive solution would have been to just download the entire file and process it from memory, not an options for us as the files we are going to be processing could easily exceed the total memory available to our server process.
Next we considered streaming the file from S3 using a read stream built from a call to s3.getObject
, but it was not clear how long we could keep a stream from s3 paused while executing long running async processes or just how much data the stream would store in memory at any given time.
At this point we were left with either implementing some underlying stream logic to honour an async callback handler of indeterminate length or finding a different solution.
It turns out, you can specify a Range parameter to S3 when downloading a file, which allows you to request a specific start and end byte, this felt like a solution.
We can now request form S3 a number of bytes that is acceptable to our limited local memory, find the lines and process them one by one, when we run out of lines, request some more file.
I’ve wrapped this up in an npm package S3 Readline and added support for custom delimiters and custom chunk sizes, the below example guarantees that only one MiB of the file will be loaded into memory at any given point (with the addition of whatever data was left form the previous chunk after all the lines were processed).
getLines
returns an AsyncGenerator (a neat feature of javascript introduced to V8 et al in 2017) and allows the developer to pull one line at a time form our S3Line Reader class, new data will be fetched asynchronously in the background while the last line of our previous chunk is being processed (which, if noticeable, will manifest on the await
on line 14
of the above example)
And thats it, I may at a later date look into what the exact restrictions around streams from s3 and timeouts are and implement a version of this lib that uses native streaming, but for now this is a perfectly workable solution.