Can You Run an SQL Query on a Piece of Paper Using AWS?
Have you ever drawn some wonderful tabular data on a piece of paper and thought, “Wouldn’t it be nice if I could run an SQL query on this”?
Have you ever wondered if the animals in George Orwell’s Animal Farm could have benefited from AWS to categorise which animals were good or bad using only serverless components?
If you answered yes to at least one of these questions, then have I got a story for you.
TLDR; Yes it can be done! Draw table -> Scan it -> Use Amazon Textract to convert to CSV -> Upload CSV to S3 -> Run a Glue crawler -> Query table using Athena -> Success.
Step 1 — Make some data
I drew this table as neat as I could — probably should have used a ruler but ¯\_(ツ)_/¯.
So how can I convert this into a machine readable format? Enter Amazon Textract.
Step 2 — Convert to CSV
According to the documentation,
“Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables”
This sounds perfect, but can it recognize handwriting? Let’s find out…
That actually worked — but TBH it took a bunch of attempts before I got it right. The first iteration of my table it thought my writing of the word GOOD was “GOOL or GOOI”, and it generally struggled with the way I usually write the number 2 and thought it were the letter d. Case in point…
That said, the character recognition of my cursive handwriting is actually pretty good — way better than I expected.
Once I got my handwriting in order, Textract recognized everything, and then I could export my beautiful table into a CSV.
Step 3— Upload to S3
Make an S3 bucket, upload the CSV to it.
Step 4 — Run a Glue crawler
Now we have our data in an S3 bucket, before we can run some queries on it we need to analyse the data and create a schema. We could create this schema manually rather than using a Glue crawler, but that would defeat the purpose of these exercise, so here we go.
Below is the configuration of the Glue crawler I have setup. Basically what this is going to do is look through all the CSV files in my S3 bucket (I only have one), work out the schema of the data, and map it to a database called animal-morality.
Now let’s run it…
Success!
That’s actually quite impressive — we’ve now gone from data on a piece of paper, to a CSV, to now knowing the schema of the data, all using serverless components. Noice.
Step 5— SQL query time using Athena
We’ve got this far, now it’s time to run some queries. Switching over to AWS Athena, we can run SQL queries on our data using the newly create database / tables. When we run our query, it will scan the contents of the CSV file in our S3 bucket, using the schema that the Glue crawler identified.
Step 6 — Success!
There you have it folks — turns out you can run an SQL query on a piece of paper. In the real world (or even in a futuristic cloud computing assisted version of Animal Farm) of course you would never do this with a hand/trotter drawn dataset, however the general theory and AWS services showcased could be used to query very large datasets from CSV’s / Parquet, etc — using only serverless components.