Processing huge files from S3: part 2
A few weeks ago I published an npm package and writeup that enabled a user to specify how much of the file should be downloaded and held in memory at any given time, I wanted to have precise control over our local memory utilisation when dealing with multi gigabyte files in s3.
Since That blog post I’ve delved down into the depth of nodejs streams to look for a better way, to cut a long story short, I probably wrote a useless thing, in almost all circumstances this is all you need to read a large file from s3 in a memory efficient way:
So what happened?
One of the goals I’d set myself for this weekend was to rewrite the simplistic buffering approach in the s3-readline library with a more elegant stream implementation.
One of the primary concerns when writing the library was be able to keep the amount of memory being used up in our nodejs application by the incoming stream from s3 small, In the first cut of the lib I used the Range parameter on the getObject
method to request specific chunks of the file from s3.
After some reading around backpressure in nodejs streams It became obvious that setting a high water mark on the readable stream would limit the amount of data that the readable stream stored in its internal buffer. Unfortunately, as the S3 class was constructing the stream in its entirety, there was no way for us to overwrite the high water mark.
The solution now seemed simple, I’d ditch the @aws-sdk/client-s3 dependency and write my own s3 readable stream implementation around createConnection and the AWS REST API as http.request()
doesn’t let you specify the high water mark of the underlying socket.
Before I set about with the above, I thought I’d take my new found understanding and check what the highWaterMark set by the aws sdk actually was on the stream:
16 KiB, 16 kibibyte is smaller than I was expecting, there are very few platforms that wouldn’t be able to handle buffering 16k of buffered data. This is the point where I suspected the lib I wrote may not be that useful, but at least my lib wouldn’t run into timeout issues…
As I stated In my original blog post, I was unsure of just how long you could pause (not read from) the Readable Stream that was returned from s3.getObject
I had assumed (spoiler, mistake) that there would be some finite time in in which the underlying call would time out, so, two weeks late, I decided to test this assumption.
I ran (some variation of) the following code:
The above code would log keep triggering the async iterator lines
until it had pulled ~32k chars (to ensure that the internal buffer (~16k) was exhausted and more reads had to be triggered) and then sit and do nothing for 0, 30, 60, 90 seconds, etc before pulling the next 32k. The idea was that at some point it would throw, and I’d have validated my timeout theory.
Unfortunately (for me) what actually happened was the stream happily waited the arbitrary times I’d asked it to without complaint, when I reached 600 seconds, I decided that that was probably longer than I’d ever practically need and stopped testing.
So to wrap up, using s3.getObject().body
and passing it to readline.createIterface()
is both performant, wont timeout when waiting for long running async operations between lines and is efficient on memory usage.
The moral of this story: Its often tempting to solve an issue without testing your underlying assumptions