As an extension of the previous project and article on air pollution, by popular request I have done some extra work.
Whilst I have written the previous as a megaphone for public awareness, this instead is a simple technical one addressed to business intelligence / data analysts or professionals, describing the common steps, skills / tools I typically use to work on these projects.
Air pollution: looking after my hometown
Exploiting public data and dataviz to monitor air pollution. An example of data democratisation to put a spotlight on…
I have to thank, first of all, PeaceLink guys who helped me to get the data and are willing to carry on with citizen science and data democratisation to increase awareness on public health.
PeaceLink is a nonprofit organisation, promoting solidarity, human rights and environmental protection. Decent human beings.
What follows is the result: a near real time interactive dashboard that monitors the main pollutants measured in the whole Italian Puglia region, showing data as soon as they are provided (typically on an hourly basis) by Arpa Puglia data systems.
You can embed it, but actually you can interact with it right here below.
How to build it
If you are curious and interested to know how I built it please keep on reading, otherwise stop here and go do some good for this planet 😁
1. Find the data
The very first big problem: finding usable data. It’s often the main hurdle that can discourage any good intent, but in this case data is public, phew… no NDAs to sign…, the only difficulty was really to know about them, but Google really helped. A simple search brought me to the Arpa Puglia public datasets open data project.
2. Which format / API to query the data
At glance I noticed the portal would allow users to download a CSV dump, a good start, but then I realised they have implemented the solution using CKAN , “a tool for making open data websites. (Think of a content management system like WordPress — but for data, instead of pages and blog posts.”. Great stuff.
3. Get the data
Among the many ways provided to interrogate the data, as I’m a programmer and proud of it, I chose the quick and dirt approach: using a Python script to connect to CKAN Data API, download the data in JSON and transform it into a table (DataFrame). I am not here to advocate this is the way to go, but if you are a programmer you can relate.
Here it is:
import pandas as pd, datetime, urllib, jsontoday = datetime.datetime.today()
start = today - datetime.timedelta(days=1)
startDate = start.strftime("%Y-%m-%d")link = 'http://dati.arpa.puglia.it/api/action/datastore_search_sql?sql=SELECT%20*%20from%20%22d08afe64-0285-4334-bc9f-92f5d95e106e%22%20WHERE%20Data_Rilevazione%20%3E%20%27'+startDate+'%27'data = json.loads(urllib.request.urlopen(link).read().decode())
output_data = pd.DataFrame(data['result']['records'])
Every time the script runs, it will hit the SQL query endpoint requesting the data for the last day. You can notice the SELECT statement passed as a URL parameter.
The JSON data returned is converted into a Python object and then a tabular DataFrame is created out of it with the records.
4. Data preparation
At this point I switched to a tool (Omniscope) in which I have copy-pasted the Python script in a block of a data workflow, so I can pull the data, clean it up and join the pollution measurements with other lookup tables all in the same environment, basically mixing scripting in Python with pre-built native data input and preparation blocks / routines.
This is what the data workflow looks like (you can interact to explore block options and data, but not edit it…)
N.B. If you are not a developer you could drop the Python script and use the workflow File Input block to pull and parse the JSON data from the online endpoint, but I am so proud of that script that I kept it 😎
5. Data visualisation
“Without data visualisation data democratisation wouldn’t be possible” ¹
At the end of the workflow there’s a blue report block connected, the interactive dashboard querying the prepared streamed data, on which I added some charts that made sense for the case to visually explore information, like bar/line chart, box plot, pivot table, map and filters to allow users viewing and digging into the 24 hours observations and trends.
6. Automation / data refresh
The final touch is to use the app scheduler tool to automate pulling of the data from the data provider and refreshing the dashboard. I spare you from details, it’s just about setting up through the app web UI how often you want to execute a block, an action or workflow. All automated, no scripting required.
That’s it. Simply the 6 common steps that I always find myself going through when preparing BI / data analytics projects, starting from any kind of raw data, through ETL and analytics, to visualisation.
Drop me any feedback. A presto!
¹ sᴀɪᴅ ɴᴏ ᴏɴᴇ ᴇᴠᴇʀ