Visualizing Air Quality Data
With ggplot2, gganimate, dygraphs(R) and folium(Python)
I wrote a post on visualizing PM 2.5 concentration before. With KDD Cup 2018(KDD Cup of Fresh Air) coming up, I figure it is a good time to write an update version.
This will be a miscellaneous collection of visualization, external resources, and random thoughts(I wouldn’t call them analysis), based on the KDD Cup dataset. Although the competition rule did not seem to explicitly forbid public sharing of its dataset, I’d prefer to play it safe and ask you to download the dataset from its website if you want to run the code locally.
Missing Data (ggplot2)
The dataset includes hourly observations from 35 stations in Beijing and 13 stations in London (and also some stations in London that are not included in the prediction set) spanning from Jan. 2017 to Mar. 2018. New data from Apr. is available via an official API.
Take stations in Beijing as an example:
plot_na <- ggplot(
bj_aq_full[PM25_NA==1],
aes(x=time, y=PM25_NA +
runif(sum(bj_aq_full$PM25_NA==1), -0.1, 0.1))) +
facet_wrap(~station_id, ncol=1) +
geom_point(size=0.1, alpha=0.5) +
labs(y="", title="Missing Data Points for Each Stations") +
theme_bw() + theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()
)
I added some jitter using runif(sum(bj_aq_full$PM25_NA==1), -0.1, 0.1))
, so we can see the density of missing points more clearly. There are quite some missing data points across the board, with some station missing out an entire segment(like zhiwuyuan station in the last row).
It can be helpful to also include valid data points in the plot:
plot_na <- ggplot(
bj_aq_full,
aes(x=time, y=PM25_NA +
runif(nrow(bj_aq_full), -0.3, 0.3))) +
facet_wrap(~station_id, ncol=1) +
geom_point(size=0.1, alpha=0.5) +
labs(y="", title="Missing Data for Each Stations (0: Present, 1: Absent)") +
theme_bw() + theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
This is to make sure all data points are accounted for. Through an earlier version of this plot I was able to find that some segments like early Jul. 2017 and May. 2017 are missing for all stations. Therefore we have to put the timestamps in ourselves (this is important especially if you want to do sequence modeling later):
full_keys <- expand.grid(
time=seq(ymd_hms("2017-01-01 16:00:00"),
ymd_hms("2018-03-30 16:00:00"),
by="1 hour"),
station_id=unique(bj_aq[,station_id])
)
bj_aq_full <- merge(bj_aq, full_keys, by=c("time", "station_id"), all.y=T)
The missing data can make forecasting difficult because you don’t have a reliable source of history. Ways to alleviate this problem includes cross-referencing data points from different stations, but it won’t help when the entire segment is missing, in which case yearly correlation can not be used either.
PM 2.5 Concentration for a Single Station (dygraphs for R)
dygraph(
quality[station_id=="tiantan_aq", .(date, lower, avg, upper)],
main="Tiantan Daily PM2.5 Concentration (Max clipped to 500)"
) %>%
dyAxis("x", drawGrid=F) %>% dyAxis("y", drawGrid=F) %>%
dySeries(c("lower", "avg", "upper"), label="PM2.5") %>%
dyLimit(0, "good", labelLoc="right", color="grey") %>%
dyLimit(12, "moderate", labelLoc="right", color="grey") %>%
dyLimit(35, "unhealthy for sensitive groups",
labelLoc="right", color="grey") %>%
dyLimit(55, "unhealthy", labelLoc="right", color="grey") %>%
dyLimit(150, "very unhealthy", labelLoc="right", color="grey") %>%
dyLimit(250, "hazardous", labelLoc="right", color="grey") %>% dyShading(0, 12, color="#ccff99", axis="y") %>%
dyShading(12, 35, color="#ffffcc", axis="y") %>%
dyShading(35, 55, color="#ffebcc", axis="y") %>%
dyShading(55, 150, color="#ffcccc", axis="y") %>%
dyShading(150, 250, color="#e6ccff", axis="y") %>%
dyShading(250, 500, color="#ffddcc", axis="y") %>%
dyOptions(colors = RColorBrewer::brewer.pal(3, "Set1")) %>%
dyRangeSelector(dateWindow=c(as.Date("2017-01-01"), max(quality$date))) %>%
dyLegend(width = 200, show = "follow")
This package dygraphs for R is extremely easy to use and create beautiful interactive charts. A huge chunk of the above code is dividing the chart into 6 regions according to the US EPA AQI standard. I was not able to follow the color designation exactly because the background shading must be light enough so the foreground is visible.
The center line is the mean concentration of the day, the shading covers 5 to 95 percentile of the day. The maximum 95 percentile is clipped to 500 as there are some outliers (way more than 500 ug/m³).
I probably should have explicitly stated the unit of concentration in the chart.
Station Geo Info via Folium (Python)
data = pd.merge(
bj_aq[bj_aq["time"]=="2017-01-01 20:00:00"],
stations, on="station_id")map_hooray = folium.Map(
location=[39.929, 116.8],
tiles = "Stamen Terrain",
zoom_start = 9)
for _, row in data.iterrows():
color = colourgrad(
0, 500, min(row["PM25_Concentration"], 500))
folium.CircleMarker([
row["Latitude"],row["Longitude"]],
color=color, radius=9, fill_opacity=1,
fill=True, fill_color=color,
popup=row["station_id"] + ":" + str(row["PM25_Concentration"])
).add_to(map_hooray)
This part is largely inspired by this repo (TheMoods/AirChina) to use the Python package folium to visualize geographical information.
I’m sure we can combine several snapshots and make an animation in leaflet.js (folium is a Python wrapper of leaflet), but I couldn’t find a easy way to do it in folium. This brings us to the next section.
Animation using gganimate
The background map is not essential in the animation (you only need to see it once), so why not just ditch it? Based on this idea, I used gganimate to visualize the dispersion of PM 2.5 particles:
(20180412 Update: the long version has been reinstated by Youtube after appealing.)
p <- ggplot(
bj_aq_full[time<ymd_hms("2017-01-10 00:00:00")],
aes(x=Longitude, y=Latitude,
fill=pmin(500, PM25_Concentration),
frame=time, cumulative=FALSE)) +
geom_point(size=8, pch=21, color="grey50") +
scale_fill_gradient(low = "#FFFF99", high = "#990000",
na.value = "white", guide = "colourbar")
p <- p + theme_minimal() + labs(fill="PM25") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
animation <- gganimate(p, "animation_tmp.gif", interval = 0.5, ani.width = 1280, ani.height = 720)
# gganimate(p, interval = 0.5, ani.width = 1280,
# ani.height = 720, filename="pm25_0101_0110.mp4",
# other.opts = "-pix_fmt yuv420p")
The mechanism of gganimate is really simple. It creates one PNG file per frame, and combine them together to make a GIF file or an MP4 file using ffmpeg (the commented out part).
We can see some regional patterns going on in the video. So taking information from nearby stations can be helpful when forecasting.
Be aware that the low frame rates might make some video player not able to play the resulting MP4 files properly. One of the solution is to tell ffmepg to artificially increase the frame rate, but I find this solution too slow and the larger size of the output unacceptable. Just find a video player that supports lower frame rates (obviously Youtube supports them at least).
Other Resources
The above is a very good overview on how to forecast PM 2.5 concentration, along with an awesome visualization. It tells us that one of the most important factors is the wind, whose forecast is part of the bigger weather forecasting problem. So this competition might actually boil down to an weather forecasting problem.
In The Signal and The Noise, Nate Silver dedicated an entire chapter to weather forecasting, referring it as one of the rare success stories of forecasting. Indeed, the forecasting and visualization of weather is widely available and recognized. And the forecasting of pollutants based on weather is also well-studied.
AFAIK, most of these forecasts are using the large-scale simulation (ensemble forecasting) technique. It seems to me the main challenge of this KDD Cup is to match that technique with much less data (we don’t have pollutant readings for nearby areas) and much less computing resources (no supercomputer cheese). We are allowed to use public external data, though. So you can grab the forecasts of wind speeds and directions, and based you forecast on them. But the model will be hard to validate because we don’t have the forecast data in the past. All in all, it seems to be a very challenging problem to be tackled.
Source Code
Please refer to the following Github repo for code used in this post:
And here are some rendered pages: