How in the world do you access air quality data older than 90 days on the OpenAQ platform? One way is to use Amazon Athena.

This post is written by Heidi Yoon, OpenAQ’s Community Engagement Officer, which is based on a GitHub gist by our co-founder, Joe Flasher, and the User Guide for Amazon Athena.

If you would like to access air quality data from the last 3 months, then you can access the data from the OpenAQ platform using our API and other tools available through our website!

If you would like to access air quality data older than 90 days, then you can access all of the data from S3 buckets, which are cloud data storage managed by Amazon Web Services (AWS).

As a historical note, this was a change that occurred in late 2017 (see our blog post here!). At this point, the OpenAQ platform began housing over 100 million data points in parallel, in S3 buckets and in a database accessed by the API. Once we were managing over 100 million data points, the API performance became slower and the financial costs increased to maintain the database. So, we decided that the bulk of the data would continue to be stored in S3 buckets, while only the last 90 days would be available through the API.

Anyone can access our S3 buckets.

To query the S3 buckets similar in ways that you may have done before with the OpenAQ API, you will need a distributed query tool like Amazon Athena, Apache Spark, or Google BigQuery.

In this blog post, I’ll describe in detail how to use Amazon Athena to query our data and save your results.

As an aside, we are also working to find other ways to make the historical data more accessible. If you have ideas, please contact us! (via info@openaq.org, GitHub, or Slack.)

Onto ATHENA!

Amazon Athena is a query service that you can use to analyze data in S3 buckets. To use Athena, you will need an AWS account. The basic AWS account is free, however you will need to pay for the Athena queries you run. The cost of a query depends on the size of your dataset, and for our current OpenAQ dataset, this will cost about $0.01 (USD) per query.

Using the AWS Management Console

Here, I’ll describe how to access Athena using the AWS Management Console, but if you are extra savvy, then there are other ways that you can explore how to access Athena, like with the CLI, JDBC, and their API (user guide). Once you have an AWS account, you can easily log in to the console on any AWS web page in the upper right hand corner. Once you’ve logged in, you can access Athena by typing it in the given box or selecting Athena, under Analytics.

If this is the first time you are opening Athena, you will go to a Getting Started page. Choose Get Started. The tutorial launches automatically. Feel free to take the tutorial or close it. You can always run the tutorial later, if you wish, by clicking on Tutorial in the upper right hand corner.

Selecting Your Access Region

Before we run anything with Athena, please check your access region, and change your region name to US East (N. Virginia) in the upper right hand corner of the console. You can do this no matter where you live in the world! By choosing a region, you are choosing where Athena should run its query. By running the queries in US East (N. Virginia), your queries will be completed faster and save OpenAQ money on data transfer.

Creating a Table for the OpenAQ Dataset

The first step is to create a table for the OpenAQ dataset. This table tells Athena where the OpenAQ data exists in S3 and specifies the data structure of the dataset (ie. column names, data types, etc.) You could write the table yourself, or you can use the table definition in the openaq.ddl file, given at the bottom of Joe’s gist. Copy and paste the entire table definition into the Query window for a new query and click on Run query. If the table loaded successfully, you will see the message, Query Successful, in the Results window below. You should now see the OpenAQ table on the left panel under Tables. To view sample rows of the OpenAQ table, you can choose Preview Data by clicking on the vertical three dots next to the table name in the left panel.

Running Queries Using Athena

Queries in Athena are written using standard SQL. Using a New query window, you can run any query that you wish! Once the query is completed, the results will populate in the window below.

Here are a few, example queries.

-We can query all the data for a particular location (or city or country), like Manama, the capital of Bahrain, using the query below.

-Or, we could conduct a more specific query. Let’s say, we only needed the date and PM10 values for the station named Otoka in Sarajevo, Bosnia and Herzegovina. Then, the query would be as follows.

Joe has some more examples of sample queries in his GitHub gist here! By using Athena or any other distributed query tool, you should be able to easily access any and all of the air quality data that you wish from the OpenAQ platform.

Saving Queries and Results

Once you run a successful query, it’s easy to save your results immediately as a CSV file. For example, we can look at a screenshot of the Results window of the Manama query from above.

By clicking on the save icon in the upper corner of the Results window, you can easily export the data as a CSV file. If you don’t save your results immediately, you can also export as a CSV file later by accessing the History tab.

Athena automatically retains queries and their results for 45 days, and you can view all your recent queries using the History tab or in your unsaved S3 buckets assigned to your AWS account.

We hope this helps unravel some of the mystery of Amazon Athena and that you can access all of the OpenAQ data for your air inequality work! As we said above, we are working to find other ways to make the historical data more accessible. Please let us know if you have ideas! (via info@openaq.org, GitHub, or Slack.)