Random-Access (Seekable) Streams for Amazon S3 in C#

Published in

circuitpeople

6 min readNov 10, 2020

Some stream interfaces to S3 objects are better than others.

Some files are big enough that working on them in memory isn’t desirable, or even possible. Even if memory size isn’t an issue, the data transfer moving the entire file from S3 can be expensive and wasteful.

In this article I demonstrate using the S3 API’s to implement optimized reads via a general-purpose, “seekable” Stream implementation in C#. This stream can be used with almost any existing code library or package, and I give examples using DiscUtils, MetadataExtractor, Parquet.Net and the .Net Core compression library.

The solution is shown to provide orders of magnitude reduction in network traffic when performing simple operations on ISO, Image, Zip and Parquet files and performs similarly with other complex binary formats up to the 5TB S3 object size limit without noticeable degradation in performance (so long as only a small amount of the content is actually being used).

Trial and Error

Let’s set the stage: Suppose I have disc image files in ISO format on Amazon S3 from which I need to extract some information from just one or two files. I will only do this on-demand (e.g. when someone clicks a button) and can’t afford to spend much money on it.

I’m familiar with the excellent DiscUtils library, and know I want to use it since the ISO format is complicated and I don’t want to invent the wheel to read from them:

dotnet new console
dotnet add package AWSSDK.S3
dotnet add package DiscUtils

The DiscUtils website example code for opening a disk image and reading a file looks like it’s very straightforward, and it seems like it will be simple to grab the stream from S3 and use that to hand to the CDReader:

using (FileStream isoStream = File.Open(@"C:\temp\sample.iso"))
{
 CDReader cd = new CDReader(isoStream, true);
 Stream fileStream = cd.OpenFile(@"Folder\Hello.txt", FileMode.Open);
 // Use fileStream...
}

That being the case, the code for reading the ISO from S3 would be something simple like this:

var s3 = new AmazonS3Client();
using var response = await s3.GetObjectAsync(BUCKET, KEY);
using var iso = new CDReader(response.ResponseStream, true);
using var file = iso.OpenFile(FILENAME, FileMode.Open, FileAccess.Read);
using var reader = new StreamReader(file);
var content = await reader.ReadToEndAsync();
await Console.Out.WriteLineAsync(content);

Most libraries that work with files by taking filesystem paths will also be able to work with Stream objects (not all, but most). DiscUtils does, as the CDReader class may be constructed with either. The ZipArchive class in System.IO.Compression is another example.

So the code above should work, right? Nope.

Unhandled exception. System.NotSupportedException: Specified method is not supported.
 at System.Net.Http.HttpBaseStream.set_Position(Int64 value)
 at DiscUtils.Iso9660.VfsCDReader..ctor(Stream data, Iso9660Variant[] variantPriorities, Boolean hideVersions)
 at DiscUtils.Iso9660.VfsCDReader..ctor(Stream data, Boolean joliet, Boolean hideVersions)
 at DiscUtils.Iso9660.CDReader..ctor(Stream data, Boolean joliet)
 at Seekable_S3_Stream.Program.Main(String[] args) in C:\Users\lee.PC\Seekable S3 Stream\Program.cs:line 17
 at Seekable_S3_Stream.Program.<Main>(String[] args)

Our first attempt throws a NotSupportedException when run, specifically in the HTTP stream class. It makes sense that a library to read ISO files would seek to positions in the file because usually it can (files read from disk support seek operations) and because the format of ISOs encourages it by making extensive use of dictionaries and file position offsets.

Like ISO files, Zip files are structured with a “central directory” at the end which “points” to file content within the file with relative offsets. To find a single file, the directory needs to be read first, then the section of the file containing the desired content. Reading any other part of the file is unnecessary.

It also makes sense that an HTTP stream isn’t “seekable”, but why maybe isn’t obvious. The HTTP protocol doesn’t include a way to stop, rewind and replay a download. Well, that’s not quite right — HTTP *does* define a way to do that, but it’s optional and not all (few in fact) web servers support it. Lucky for us, S3 is one of those HTTP services that does support HTTP’s method for “seeking” by using Range headers (which I’ve written about before). It’s not quite the same, though, so we’re going to need to get clever to make this work.

On StackOverflow you’ll find plenty of advice to workaround this issue by copying the S3 response stream into a MemoryStream. Let’s try that:

…
using var response = await s3.GetObjectAsync(BUCKET, KEY);
var ms = new MemoryStream();
await response.ResponseStream.CopyToAsync(ms);
ms.Seek(0, SeekOrigin.Begin);
using var iso = new CDReader(ms, true);
…

Now, that’ll do it for sure, right? Well, yes, sort of.

It works, albeit slowly. Very slowly, because the ISO files I’m working with are quite large. The available network bandwidth for a single Lambda execution is in the range of 1–2Gbps so the response with this approach will be so slow as to be unworkable. Moreover, fanout techniques won’t improve the effectiveness here. So, what to do? Punt and use a high-memory instance or container? Nope. Fix the root problem.

We need a Stream implementation that supports seeking to skip past sections of the file, or rewind to access previously read sections. That’s how the CDReader class expects to work with a stream so it only reads a tiny fraction of the file’s content to get at a single file. We can confirm this by creating a simple stream wrapper that records the number of bytes actually read (versus skipped by seeking from place to place):

class ReadCountingStream : MemoryStream {
 public long Total = 0L;
 public override int Read(byte[] buffer, int offset, int count)
 {
 var read = base.Read(buffer, offset, count);
 Total += read;
 return read;
 }
}
…
 var ms = new ReadCountingStream();
…
 await Console.Out.WriteLineAsync($"{ms.Total/(float)ms.Length * 100}% read");
…

Running the above code I gives me the percentage of the file that is actually used, and establish the potential for optimizing the performance of the code (the time to read the contents is the vast majority of the time used):

0.003969788% read

Well, that’s good news: Less that 1/100th of a percent of the file is actually read. If we implement a “perfect” stream that reads only those bytes from S3 we can expect an enormous performance improvement (probably bounded by latency rather than bandwidth). Cool, let’s do it.

This time instead of wrapping a concrete stream we’re going to create a ground-up implementation that makes calls to S3 when blocks of the file are needed based on the value of the Position property and number of bytes needed. The source for my SeekableS3Streamclass can be found on Github (link below), but here is how it’s used:

…
var stream = new Cppl.Utilities.AWS.SeekableS3Stream(s3, BUCKET, KEY, 1 * 1024 * 1024, 4);
using var iso = new CDReader(stream, true);
using var file = iso.OpenFile(FILENAME, FileMode.Open, FileAccess.Read);
using var reader = new StreamReader(file);
var content = await reader.ReadToEndAsync();
…

It’s even simpler than using S3 directly. To see how it performs we’ll again print the read statistics for comparison:

await Console.Out.WriteLineAsync($"{stream.TotalRead / (float)stream.Length * 100}% read, {stream.TotalLoaded / (float)stream.Length * 100}% loaded");

Which prints:

0.003969788% read, 0.3565448% loaded

So, with the new stream in use the amount of data transferred from S3 drops by more than 99.6% — more than 100 times faster! And that’s before optimizing the size of the ranges being pulled, or the number held in the MRU list, so further improvement are probably possible.

This stream implementation isn’t perfect but it’s far, var better than the MemoryStream approach. So much so that it’s a no-brainer to drop that bad idea like a hot rock.

Optimization

Somewhere between a “pure” streaming solution where one buffer at a time is available and the “read it all” approach of a MemoryStream is the “right-sized” solution that pulls only the needed bytes from S3, and uses just enough memory to avoid repeated calls to S3 when the same file ranges are used multiple times.

Ideally, the right amount of memory is determined by the size of the Lambda instance (memory ranging from 256MB to 3GB) or container on which the code runs as well as the workload (amount of the file ultimately used). My SeekableS3Stream class lets me tweak the size of ranges and how many to hold in the recently used list. Those two parameters allow me to tune the performance for any given workload.

The size of the ranges is important as it controls the “excess” bytes. If the ranges are too large for the workload, much of them will remain unused. For example, if the range size is 1MB and only the first KB is used that would potentially inefficient. On the flip side, if the range size is too small it can result in making a large number of calls to S3.

The number of ranges to hold in the MRU becomes important when the workload would result in repeated reads of some ranges. For example, Zip files contain a central directory structure

Summary

So if you’re working with very small files, an approach using byte[] or perhaps a MemoryStream provides effective read/write, random-access memory. But it’s sub-optimal from a cost perspective and won’t scale. Maybe consider using the seekable stream approach.

You can find my code for SeekableS3Stream as well as example for Parquet and other formats on Github.

Random-Access (Seekable) Streams for Amazon S3 in C#

Trial and Error

Optimization

Summary

Written by Lee Harding