How to collect data like a spy — Part 4

Turning JSON into a database with AWS Athena

So, we have our data in our S3 bucket, which contains 1,000s of JSON files with the details of tweets with geo locations, how do we analyse, search and process this data?

AWS Athena

Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you only pay for the queries that you run.

It’s designed to support very large datasets of CSV, JSON, ORC, Apache Parquet and Avroand files and is Presto with an Amazon wrapper.

You can also use Athena on encrypted buckets, more here.

Configuring Athena

We need to create a database in Athena with the following configuration. We have called the database Twitter and the table Geo.

We tell Athena that the bucket contains JSON files.

The names of the columns refer to the data within the json file, so;

created_at = string
source = string
text = string

For the geo data, as this is an array, we need to use the following;

geostruct<coordinates:array<float>>

We have now configured Athena. We can now start query this with R and rStudio (Part Five).

The Series

Part One — How to collect data like a spy
Part Two — Getting NiFi up and running
Part Three — How to collect social media data like a pro
Part Four — Creating a database with AWS Athena
Part Five — Connecting RStudio to Athena
Part Six — Creating Maps of the Data in RStudio
Part Seven — Creating an interactive dashboard for your data

Show your support

Clapping shows how much you appreciated Mark Craddock’s story.