How we use S3 Select for Schema Validation and Filtering Data

Nagaraj Tantri
MiQ Tech and Analytics
6 min readOct 16, 2018

What if you have to validate or filter data on S3 to get more specific insights?

Currently, these are the most common options:

  • Copy S3 files to Redshift and then apply to filter on that data.
  • Hive Script to load the data and run query.
  • Athena to load the file and run the query.

They usually tend to be slow or costly and are cumbersome for smaller ad hoc requirements.

What if there was a way to run queries directly on S3 files

Introducing — S3 Select from Amazon Web Services (AWS)

(Picture courtesy AWS https://aws.amazon.com/blogs/aws/s3-glacier-select/)

S3 Select

S3 Select is a unique feature introduced by AWS to run SQL type query direct on S3 files. AWS states that the query gets executed directly on the S3 platform and the filtered data is provided to us. Since it runs directly there, it’s pretty fast to run ad hoc queries on those files.

Use Cases:

  • Validating Data: When we run our hive scripts, there are lots of errors because of a bad schema. These errors are very difficult to identify with the growing number of files. S3 select can help quick schema level validation and file data checks.
  • Filtering Data: Let’s say we want to get all the customers who use a desktop to purchase on an e-commerce website. If we have plenty of S3 CSV files which are compressed and the files have a specific column which provides information about “Device Type”, then we can quickly get these customer details by querying for device type value equals “Desktop”.

AWS CLI

Here is the quick CLI version of running the query:

aws s3api select-object-content --bucket "bucket-name" \
--key "path/to/file/mysample-file.tsv.gz" \
--expression "select * from s3object limit 100" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {}}' \
"sample.csv"

The above query attempts to select all the data from a file mysample-file.tsv.gz placed in the following folder: s3://bucket-name/path/to/file/. The file is a CSV file which is GZIP compressed. The output is a CSV data, saved in a file sample.csv

With above, you can find the select-object-content related document here: https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html

NOTE: Version of AWS CLI. should be above 1.15.0

SQL Syntax Highlight

  • The columns can either be selected via name (if the file has a header column) or via index of the column.
  • The columns if are via an index, then if there is a CSV file with 10 columns then you can select the first two columns using the following syntax: select _1, _2 from s3Object;
  • The files which can be queried either can be a delimited file or a JSON file, which can be queried using the syntax of name, index, or object notation. i.e.: if there is a json object then we can select s.projects[0].project_name from S3Object from a json object which is:
{"projects":
[
{"project_name":"project1", "completed":false},
{"project_name":"project2", "completed":true}
]
}

More on syntax can be viewed here: https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html

AWS Python Based

At MiQ, we have created a script which will take care of executing the S3 queries on files and print output to a file or to the console.

Here are the steps before you run the script:

Once you have the above installed, you can then have the following files:

  • select_runner.py
  • s3_executor.py

You can check out the source code here on our MiQ Github Page:

With the above files in place typing python s3_executor.py — help, you will get the options for running the script.

Here are some explanations of the items:

  • -comp -- compression: Is the compression available — true/false
  • -c --ctype: Content Type of the file — CSV/JSON
  • -d --delimiter: Provide the Delimiter — COMMA/TAB. We right now have not implemented other delimiters.
  • -s -- sql : Provide the SQL to be executed, for example: select _1, _2, _20 from s3object
  • -o --outputfile: Provide the filename to dump the records after running the query.
  • -e -- throwerror: Boolean value true/false — which determines whether to throw error while processing if the file is corrupt or has lesser than 500 bytes (which in our case means the file has no data and would only have headers)

Execution:

Here are the different examples to run the file:

  • Select all rows from files under a folder where a 20th column value is “VIEW_DETECTED” and get [1st column, 2nd column, 15th column] from the schema, as well save it to a file sample.csv”, then we would run it using the following — query: “select _1, _2, _15 from s3object s where _20 = ‘VIEW_DETECTED’ “
python s3_executor.py -b "bucket-name" -p "path/to/files/" -d TAB -s "select _1, _2, _15  from s3object s where _20 = 'VIEW_DETECTED' " -o sample.csv
  • Select all rows by applying filter on 5th column which is HTTP referral URL, where if we want to select the 2nd column and 3rd column with HTTP referral (5th column) has value with like ‘%it.eurosport.com%’ — query: “select _2, _3 from s3object s where _5 like ‘%it.eurosport.com%’ “
python s3_executor.py -b "bucket-name" -p "path/to/files/" -d TAB -s "select _2, _3 from s3object s where _5 like '%it.eurosport.com%' " -o eurosport.csv
  • Get all the columns?
python s3_executor.py -b "bucket-name" -p "path/to/files/" -d TAB -s "select * from s3object s where _5 like '%it.eurosport.com%' " -o eurosport_all_columns.csv

Since it runs directly on S3 and does not require you to create any external table, this would help us run ad hoc queries faster and download the results from multiple files. We plan to use the S3 multipart upload to upload the filtered results and can do more detailed aggregated processing.

This is how we use the above setup:

We have introduced a script which would recursively go through all the files under a folder and execute the relevant SQL type queries. Thus providing two major use case coverage:

  • Validation: We have a scheduled daily job to query the files which we receive and run these to validate the schema that is returned from the file. We also enable that on the size of the file and compression of the file to ensure we do not have any corrupt data.
  • Filtering: We filter the data from those files and have our quick ad hoc analysis which can be based on just getting data and storing it in local files. This can later be used for some quick insights or client-side reporting.

Limitations

  • Currently, we cannot run aggregation queries on complete folder level, because the queries that run are on individual file on S3 (the above script runs it recursively). This limitation is on S3 Select.
  • Cannot join on a different schema.
  • Supports a very limited number of query elements.

Reference

Update

07th December 2018

  • We have added support for Parquet Snappy Compression files

--

--