Analyzing Air Quality Data using Python Pandas and Plotly
An adult breathes about 15,000 liters of air every day so air quality is a very important factor in our lives. If this air is polluted, it will get into our lungs, blood stream and it can even be carried to our brains, causing severe health problems. Air quality also affects our natural environment — it causes acidification and eutrophication of our ecosystems which reduces agricultural productivity and causes irreversible damage to ecosystems. Granted the importance of air quality, we decided to write an introductory blogpost on how to analyze air quality data. To follow along with the analysis, you can use the Jupyter-Notebook example shared in our GitHub repository. For our example below, we will use important code snippets from our GitHub repository.
Getting Started
In this example, we use python 3. To follow along, you need pandas, requests, numpy, plotly and matplotlib modules.
Installing these packages is relatively easy and if you are already using python, you should at least have half of them installed.
Also, if you are not a Planet OS Datahub user yet, please make sure to sign up.
In the Notebook you will be able to …
- find which observational air quality datasets are available in Datahub using Search & Discovery Endpoints;
- see all of the stations on the map from the EPA AirNow hourly and EEA Air Quality datasets;
- find available variables from each dataset;
- query data from a single station, find daily values with Pandas and visualize using Plotly;
- query data from a bigger area, find daily and monthly values, visualize them;
- find maximum values and stations;
- analyze stations with maximum values from the current year;
Overview
First, we search for in situ air quality datasets in Datahub. The results return four datasets: one covering Europe, two covering the US and one covering a single station dataset from Mauna Loa, Hawaii, US.
European Environment Agency Air Quality Dataset,
The U.S. Environmental Protection Agency’s (EPA) Air Quality,
The U.S. Environmental Protection Agency’s (EPA) Hourly Data,
Weekly mean carbon dioxide measured at Mauna Loa Observatory, Hawaii
In this example, we have decided to use the European Environment Agency (EEA) Air Quality Dataset and The U.S Environmental Protection Agency’s (EPA) Hourly Data.
The coverage of these two datasets is shown on a map. Below is an example of how to make a map using Plotly.
fig = go.Figure()
fig.add_trace(go.Scattermapbox(
lat=all_epa_stations.latitude, lon=all_epa_stations.longitude,
mode='markers',
marker=go.scattermapbox.Marker(color='#EC5840',size=4),
text=all_epa_stations.station,
hoverinfo = 'text'
))fig.add_trace(go.Scattermapbox(
lat=all_eea_stations.latitude, lon=all_eea_stations.longitude,
mode='markers',
marker=go.scattermapbox.Marker(color='#4E2F90',size=4),
text=all_eea_stations.station,
hoverinfo = 'text'
))fig.update_layout(mapbox_style="open-street-map",autosize=True,showlegend=False,height=500, margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Once we have queried the data, we start using Pandas to analyze it. In this instance, we visualize daily and monthly data over the past 6 years. For this, we use Pandas resample:
station_data = station_data.set_index('time')
daily_station = station_data.resample('1D').mean()
For monthly data, we easily change the command:
station_data = station_data.set_index('time')
monthly_station = station_data.resample('1M').mean()
As we can see below, daily data looks pretty busy on a plot, however, Plotly is a powerful tool and you can easily zoom in to take a closer look at the timeframe that you are interested in. In addition, you can plot data from multiple variables into a single graph:
fig = go.Figure()
for variable in variables:
fig.add_traces(go.Scatter(x=daily_station.time, y=daily_station[variable], mode='lines', name = variable))
fig.show()
In addition to single-station data, we can work with multiple stations at once. Using this comparison, we can observe how buildings like factories or airports can affect a city’s air quality.
Pandas groupby can be used for this but we can also use resample. In this example, we group the data by stations and find the mean values. Using the code sample below, we can find the monthly mean PM 2.5 data from the stations.
ny_monthly_station_means = ny_data.groupby('station')['PM2.5'].resample('1M').mean()
Below you can see the visualized version of this data. You can exclude the station if you want and also zoom in and out. In this example, we can see that values for PM 2.5 get higher in winter months, especially during New Year’s Eve.
Now, we will look at how to visualize daily historic multi station data. For this, we use a violin plot. A violin plot helps us to visualize the distribution of the data and its probability density. With Plotly, we can do this without much effort.
fig = go.Figure()
for stat in np.unique(berlin_data['station']):
station_daily_mean_data = pd.DataFrame({'PM2.5' : berlin_daily_station_means[stat].values, 'time':berlin_daily_station_means[stat].index})
station_daily_mean_data['year'] = pd.DatetimeIndex(station_daily_mean_data['time']).year
fig.add_trace(go.Violin(x=station_daily_mean_data['year'],
y=station_daily_mean_data['PM2.5'],
legendgroup=stat, scalegroup=stat, name=stat))
fig.update_traces(box_visible=True, meanline_visible=True)
fig.update_layout(violinmode='group', width=900,height=450,)
fig.show()
In the example below, we can see that in STA.DE.DEBE065 PM2.5 values vary the most. In general, all the stations have relatively similar data to each other.
Next we visualize the stations around Berlin and New York that have the highest maximum daily values. As the graph is pretty busy, we can zoom in to get a clearer picture. We can also see how the values peak on New Year’s Eve.
fig = go.Figure()ny_max_station_mean_data = pd.DataFrame({'PM2.5' : ny_daily_station_means[ny_daily_station_means.idxmax()[0]].values, 'time':ny_daily_station_means[ny_daily_station_means.idxmax()[0]].index})berlin_max_station_mean_data = pd.DataFrame({'PM2.5' : berlin_daily_station_means[berlin_daily_station_means.idxmax()[0]].values, 'time':berlin_daily_station_means[berlin_daily_station_means.idxmax()[0]].index})fig.add_traces(go.Scatter(x=ny_max_station_mean_data.time, y=ny_max_station_mean_data["PM2.5"], mode='lines', name = ny_daily_station_means.idxmax()[0],marker_color=next(palette)))
fig.add_traces(go.Scatter(x=berlin_max_station_mean_data.time, y=berlin_max_station_mean_data["PM2.5"], mode='lines', name = berlin_daily_station_means.idxmax()[0],marker_color=next(palette)))fig.show()
We can see that the maximum in New York was on the 2020–12–06. In Berlin, the maximum was on the 2018–12–31. This probably suggests that there were fireworks going off in Berlin on New Year’s Eve!
In addition, we also show how to filter out last year’s data using Pandas.
last_years_ny_data = ny_data[ny_data.index>datetime.datetime.today().replace(year = datetime.datetime.today().year-1,month=1,day=1,hour=0)]
Conclusion
In conclusion, Pandas and Plotly are powerful tools for analyzing data. In this example, we focused on visualizing time series datasets with multiple stations and found maximum and mean values. We also resampled data to daily and monthly means for better visualization. Our analyses were simple but effective in showing how to work with air quality data. As environmental issues become more pressing, extracting insights from this kind of data will become more and more important. If there are any interesting analyses that you would like to run or any more specific data that you would like to see, please don’t hesitate to contact us!