S3 Select — new revolution “at rest”

Michał Adamkiewicz
4 min readApr 16, 2018

--

If you followed last Re-invent with Amazon Web Services, probably you heard about one of new services called “S3 Select”. Not much has passed since this event, and now on 4th April GA was rollout!. Ok…but what exactly is this about?

S3 Select is somehow new sort of technology for querying flat files. It joined to the stable where you can already find Spectrum/Athena (for those you are not familiar — both services are dedicated serverless platforms for SQL compliant queries, and both allows you to directly retrieve data based on your flat files stored on S3). I wrote somehow in fact that “Select” differs a lot from those mentioned. This service changes the way you want to consume your data!. What’s even more important it’s doing this being “at Rest”.

Said that, let’s recap what is typical pattern we (me) always follow when we want to consume CSV/JSON files from S3? Mostly there is no many possibilities — download file to local filesystem or store it directly in a memory. Starting now AWS engineers allows us to choose another aproach — select your data at storage level before it will be send by network link. Great idea!

How it works? Our examples will be shown in Python — so in case of any error it’s necessary to be sure that boto3 library is updated.

pip install boto3 — upgrade

New function provided with Python SDK is “select_object_content”. Below we can find example with appropiate arguments:

s3 = boto3.client('s3')def s3_select():
r = s3.select_object_content
(
Bucket='bucketX',
Key='json_file_Y.gz',
ExpressionType='SQL',
Expression="""select * from s3object s""",
InputSerialization = {'JSON': {'Type': 'Document'}},
OutputSerialization = {'CSV': {}},
)

As you can see there are 6 arguments neccessary to pass into this code. Bucket and key are obvious, ExpressionType currently is supported only for one value: SQL. Expression itself is ansi-SQL where you have possibility to use SELECT, FROM and WHERE clause only. Last two are information about file format of the file you are trying to read, and way it should be presented to you. You probably already noticed that “S3 Select” allows you also to directly query compressed files (Great idea again!).

To read results easily you can put:

for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)

Now what about costs and limits?

So you are obligated to pay $0.0007 per GB for data returned by your select statement and $0.002 per GB per data scanned on S3 storage. Hence natural optimization of costs is to always (when possibile) store data compressed.

Current implementation allows us to play only with JSON, CSV file formats. Only compression standard aligned with Select services is GZIP. Also regarding encoding we have right now only one position — of course UTF-8. Least but not last, there is no encryption implemented. Regarding possible features inside our expression we can easily rely on operators (conditional, math, logical or string — e.g like) or functions (string, cast, math, aggregate). Taking into account that life-time of this service is so short it’s possibilities presents themselfs very mightily, and with all confidence AWS for sure will put a lot of effort to bring new puzzles soon.

S3 Select is the first service which allows you to skip any part of the file (if you don’t need it) and directly “stream” only selected fiels/columns. The benefits of this are indisputable!. Let’s check that in a practice and compare it with “old fashion” method “get_object” available in boto3. For comparison we will use single json file compressed with GZIP — 13 Mb (70Mb uncompressed). It contains 63 000 documents inside, where each of is build within +- 45 nodes. Our goal is to get out only two nodes for every document.

First let’s build small decorator to help us measure the time:

def timing(f):
def wrap(*args):
time1 = time.time()
ret = f(*args)
time2 = time.time()
print ('{0} function took {1} s'.format(f,time2-time1))
return ret
return wrap

Now, here we have body of function responsible for downloading file and mapping JSON to retrieve only proper “fields”:

def map_json(elem):
var = elem.get('meta')
if var is None:
dct = {"session_long": None}
else:
dct = {"session_long": var.get('session_long')}
if var is None:
dct["session"] = None
else
:
dct["session"] = var.get('session')
return dct
@timing
def normal_download():
bucket = 'aws_benchmark_pl'
key = 'test-dataset.gz'
file = s3.get_object(Bucket=bucket, Key=key)
byte_file = io.BytesIO(file['Body'].read())
unpacked_file = GzipFile(None, 'rb', fileobj=byte_file).read().decode('utf-8')

data_split = unpacked_file.split("\n")
data = [json.loads(values) for values in data_split if values]
json_mapped = list(map(our_json, data))

Second function uses the new possibilities:

def new_s3_select():
r = s3.select_object_content(
Bucket='aws_benchmark_pl',
Key='test-dataset.gz',
ExpressionType='SQL',
Expression=
"""select s."meta"."session",s."meta"."session_long" from s3object s""",
InputSerialization = {'CompressionType':'GZIP', 'JSON': {'Type': 'Document'}},
OutputSerialization = {'JSON': {'RecordDelimiter':'\n'}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
TBD...
elif 'Stats' in event:
statsDetails = event['Stats']['Details']
TBD...

Below you can see only one set of times, but repeatability gives a result about 2.5 times faster for new service! (comparing in this specific example with another function implemented in boto3).

<function test_s3_select at 0x05D08228>function took 7.0760657787323s
<function test_normal at 0x05D08348>function took 25.83466076850891s

In conclusion “S3 Select” is definitely something which is desirable if you often use S3 storage. For more details you can easily try to follow this documentation:

https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectSELECTContent.html

--

--