The air mystery of Marylebone Road in London

Kexin Zhai
Spring 2019 — Information Expositions
4 min readFeb 25, 2019

A horror story of data cleaning

It was a mystery and moonless night, with dark clouds blocking all sources of light. Vague shadows and shimmering laptop of icy gloom could be seen. The wind was like a hungry falcon, constantly wandering in search of food, blowing leaves to stir and stir. Horrible black, inch by inch close, bit by bit swallowed up my line of sight. I tried to find a little voice of life, but only heard my intensified heartbeat, plop, plop…

I kept my eyes on the last source of light at the moment, trying to use our last hope, the internet, to gain access to “classified data” on London Air. Mysterious signs suggested that there might be hidden secrets in Marylebone district of the City of Westminster in London. Yes, Baker Street is in this area. There was a problem waiting for us to solve: Is there any correlation between time and value of species?

Exploring and Cleaning Dataset

I downloaded a dataset of air quality in Marylebone district from London Air website.

First, let’s read our London Air Quality Dataset on Jupyter Notebook. We can see there were missing data under the “Value” column. The raw dataset had 10956 rows and six columns.

Next, we used value_counts() to check the columns of raw data. There were six species of data of air quality measurement. They are CO — Carbon Monoxide, FINE — PM2.5, SO2 — Sulphur Dioxide, WSPD — Wind Speed, O3 — Ozone, TMP — Temperature.

Then we used isnull() to check which columns have missing data. It told us only “Value” column includes missing data.

We used dropna() to remove the rows that have an empty value. We had 6279 rows left after dropping.

Compared the value counts of “Species” column before dropping with the value counts after dropping, we noticed that the values of P.M 2.5 and temperature didn’t be recorded in this dataset.

I tried to plot our dataset but it showed that this dataset still needed to be cleaned.

Hard to understand the trend without converting the format of the time

For “ReadingDateTime” column, I removed unnecessary values, and I only kept the data of month and year. Then I split this column into two columns, “Month” and “Year”. Now, we were prepared for our analysis.

Go Through the EDA Checklist

I’ve already done the first two steps before. The cleaned dataset has 6279 rows and 9 columns. Here is how this cleaned dataset looks like.

The cleaned dataset
Check the “n”s

Findings

From 2014 to 2018, the average value of ozone increased every year while the average value of sulfur dioxide decreased every year.

Year vs. Value by Species

I zoomed in to explore this dataset by checking the value of ozone monthly and yearly. There was a common trend that air had higher ozone around May to July. The cause might be a thunderstorm in spring, which can generate ozone under the action of lightning.

--

--