Photo by Vitalii Chernopyskyi on Unsplash

An Introduction to Data Collection: REST APIs with Python & Pizzas

Thomas Kidd, Johnathan Padilla and Hadrien Picq are fellows in the TechSoup and ParsonsTKO Summer 2020 Data Strategy Mentorship Program, a select group of upcoming data analysts and scientists to learn more about what the social impact and non-profit analytics sectors look like from industry professionals.

To highlight this week’s theme on data collection, the team sought regulatory air quality data from various sources (specifically, raw measurements of fine particles: PM2.5). Each fellow explored and detailed the process extracting data from the following sources:

Using the OpenAQ API: A Walkthrough

OpenAQ is a an open-source platform continuously aggregating air quality data from all over the world from regulatory sources. We’ll illustrate how to pull data using their front-facing API, including the py-openaq API Python Wrapper.

What is an API?

‘API’ stands for ‘Application Programming Interface’. In the simplest terms, an API is “an interface that lets the program you write control or access a program somebody else wrote”(1).

You can also think of it this way: writing a program on your home computer is like cooking in your own kitchen. You have ingredients that you own, and with your culinary skills you’re able to cook yourself a fine meal. But let’s say you lack the ingredients, kitchenware, knowledge, or even patience to make, say, a pizza, in that case you might want to head down to the local pizzeria, where you can request the toppings and ingredients of your choice. Like an API, you can’t cook the pizza for yourself, but, depending on the pizzeria, you can get the type of pizza you desire (2). Think of using an API as a transaction, sometime free of charge, but with certain conditions and limitations.

FIRST THINGS FIRST: Read the API Documentation

A good practice when working with APIs is to start by reading the documentation:

This is important to understand the parameters can be called from the API (whether you can call data by location, intervals of time, attribute type etc.) and, importantly, to be aware of the API’s limits and restrictions. For example, OpenAQ has a limit of 2,000 requests over a 5 minute period. In other words, you can’t call more than 2,000 records, or rows, in under 5 minutes. APIs have rate limits to avoid being overwhelmed by requests, and as a protection against ‘bad actors’ seeking to overwhelm the system.

Furthermore, APIs can also limit the scope of all data you may be able to access from the source. For example, OpenAQ’s API is limited to an archive of 90-rolling days worth of data. So if you wanted data from a year ago, that would not be possible with this API.

Terms of licensing is also something you might want to be aware of, depending on the intended use of the data. OpenAQ API’s data is licensed under Creative Commons as “Attribution 4.0 International”, which grants authorization for the data to be transformed and used for commercial purposes.

Getting started with OpenAQ in Python

Import the package dependencies

In a Jupyter Notebook or IDE of your choice, install and import the following packages

import pandas as pd
import json
import requests

Define the scope of your query

Since the API is limited to the archive of the last 90 days, my objective will be to request measurements of fine particulates (PM2.5) for the city of San Francisco, California, for the month of May 2020 (as of June 13th, 2020).

The API we’re working with is a ‘REpresentational State Transfer’ (REST) API (3). REST is the language by which you pass your request, i.e. via HTTP commands. To borrow the previous pizza metaphor: the chef doesn’t want to hear from you directly, she’s too busy. Instead, she will direct you to a structured menu with a list of options, asking how many toppings you desire, the kind of crust you prefer, the preferred type of cheese etc.

Think of the base URL below as the bare dough which will serve as the base to your desired sauce and toppings.

You will find the base URL in the documentation.

base_url = ""

Add your toppings!

I have three variables of interest:

a. A location;
b. An interval of time;
c. A specific feature attribute.

The documentation will allow me to identify which arguments (ingredients!) I need to pass into the base URL to define my query. From the documentation, they are identified as:

a. “location”

Be careful, as cities share names between each other globally. Think carefully about what might be caveats in your query (sometime that’s just something you learn from observing your outputs).

city_name = “San Francisco”

b. “date_from” & “date_to”

The documentation informs me that the timestamp in the database is measured in UTC, thus I will need to convert my intervals from PST to UTC. There are several online tools to enable the conversion, such as

# Note that the timestamp in UTC is written in military time (24-hours)date_start = “2020–05–01T08:00:00” # 12:00 pm in PDT is 7:00 am in UTCdate_end = “2020–05–31T08:00:00”

c. “parameter”

I want measurements for fine particles, thus the variable to the parameter argument will be pm25.

parameter = “pm25”

Finally, we set all of our variables into the query URL

query_url = base_url + "location=" + city_name + "&date_from=" + date_start + "&date_to=" + date_end + "&parameter=" + parameter + "&limit=10000"

Note that we pass the arguments in quotation marks inside the query; that is because those are immutable to our argument. This is what the query looks like once it’s built:

print(query_url) Francisco&date_from=2020–05–01T08:00:00&date_to=2020–05–31T08:00:00&parameter=pm25&limit=10000

Pasting this URL into a browser will actually return the entire requested record as JSON. Copying and pasting the results into a text editor is one way of accessing the data. Notice also the limit argument, which I assigned to the maximum variable of 10,000, because the default would otherwise be 100 (which, in the interest of our query, is nearly not enough to collect the data we seek).

You might also say “Mama Mia! The URL has a blank space! Will it cause my query to break?”. Thankfully, HTTP knows how to recognize this, and will fill blank spaces with %20.

We use the GET HTTP command to retrieve data. APIs are not just for getting data; a database administrator can use HTTP commands such as POST or PUT to update or create data respectively.

Think of GET as being a customer in the pizzeria, as opposed to a food inspector.

This is not an endorsement of Pizza Hut
results_jsons = requests.get(query_url).json()

We append json() to request(), something called method chaining (4), to store the output as a JSON.

Why do we want to store data as a JavaScript Object Notation (JSON)? Well, if we’re calling a large amount of data, JSON takes less space, is faster for a search query to sparse through, and contrary to how things may look below, is actually quite readable (once you pay attention to the structure of the file).

Contrary to appearances, JSON is a quite readable file format

But OK, I hear you. This is giving you a headache, like anchovies on a pizza. Plus there’s some info under the ‘meta’ tag that we don’t really care about in the context of storing data inside a pandas dataframe.

To remedy this, we do something called a list comprehension. I.e., we store a JSON inside a list. To do so, we create a ‘for’ loop so that each element of interest is iteratively stored inside the list.

Remember that bit about the structure of a JSON file? Well, if we look at the JSON carefully, we’ll notice that there are two high level tags: ‘meta’, which we want to exclude, and ‘results’, which contain all of the ingredients we seek. Each individual record is stored inside these {}, and we want to store those into individual rows.

results_list = [results_json for results_json in results_jsons[‘results’]]print(results_list)
Still carries the appearance of a mess, but now we’ve truly isolated all of the attributes and records we’re interested in.

Using the pandas’s function DataFrame(), we can easily convert the list to a dataframe, and even further filter attributes if we so wished in the columns argument.

results_df = pd.DataFrame(results_list,columns=[‘location’,’parameter’,’date’,’value’,’unit’,’coordinates’,’country’,’city’])print(“Our dataframe has this many rows : “ + str(len(results_df)))results_df.head()
Well, obviously, some work needs to be done (looking at the data and coordinates fields); but that’s a story for another time.

Don’t forget to save your work, preferably into a csv file. In other words, store that slice in the fridge!

Pizza time!

The Jupyter Notebook of this tutorial can be accessed here & the code repository is publicly available on GitHub.

An Alternative: Using the Python wrapper for the Open AQ API

The py-openaq Python Wrapper for OpenAQ was programmed by David Hagan. Whenever you use a programmatic package in research or publication, please credit the package author(s).

What’s a Python wrapper?

It’s essentially a Python function that simplifies or streamlines more complicated functions. It’s like if you ordered you pizza online instead of walking all the way to the pizzeria.

Install the wrapper in your programming environment

pip install py-openaq

As usual, check the Documentation:

Import Dependencies

import openaqprint ("openaq v{}".format(openaq.__version__))

openaq v1.1.0

Define your variables

location = "San Francisco"
date_from = "2020-05-01T08:00:00"
date_to = "2020-05-31T08:00:00"
parameter = "pm25"

Initiate an instance of the openaq.OpenAQ class

api = openaq.OpenAQ()"

Run the wrapper

results = api.measurements(location=location, parameter=parameter, date_from=date_from, date_to=date_to, limit=10000,df=True, index=’local’)results.head()
The outputs via the wrapper are also much cleaner than using a GET API query. Hence, it pays to read the documentation!

Export your outputs again to a csv file:


Neato Mosquito! At this stage, you will get ready to “wrangle” (i.e., fix and format) your datasets. Until next time!

Aspiring data analyst with a background in GIS. Finishing a Master’s in Environmental Assessment on participatory air monitoring and Citizen Science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store