A terrible view of Madrid covered by pollution

Preparing Madrid air quality data for analysis

Some preprocessing steps for anyone interested in analysing air quality data in Madrid

Javier Canales
Analytics Vidhya
Published in
4 min readMay 19, 2020

--

Spain is one of the countries worst hit by COVID-19. Since the outbreak of the crisis in March, the Spanish government has taken several measures to contain the pandemic, putting the country on exceptional lockdown. Strong social distancing measures have been adopted, such as the suspension of classes at educational institutions, traffic restrictions and the halt of any non-key economy activity. The economy paralysis and the restrictions of the right to freedom of movement have led to a sharp reduction of transport, industrial activity and electricity generation, thus resulting in historical drops air pollution concentrations.

I have recently published an article in the Spanish site Agenda Pública analysing air quality data in Madrid during the COVID19 crisis. However, the aim of this article is simply to share the python code I used to prepare the data for analysis, as I believe it can be recycled, hopefully saving some time for those researchers who may be interested in using Madrid air quality data for their work.

As usual, the complete project can be found in my GitHub repository.

Preparing data for analysis

Madrid air quality data is available at the City of Madrid’s Open Data Portal. Air pollution is monitored by 24 remote stations located across the city. Information about the stations and the measured gases and particles can be found here. For the purpose of the analysis, I have used daily data covering the period 2010–2020 (until April).

Fortunately, the data files are well-structured and have a common format. Once we have downloaded the CSV, we can merge them in a dataframe using glob. The glob module finds all the pathnames matching a specified pattern In our case, I have saved all the files in the same path and named them using the pattern “datosXXXX.csv”, being the XXXX the year when the data was monitored. The code below will read all the files in the path matching the pattern, read them using the pandas read.csv function and concatenate them.

Now that we have our dataframe, it’s time to have a look at its appearance.

In essence, the dataframe contains daily measurements of difference pollutants (MAGNITUD) collected by the 24 monitoring stations (ESTACION) in the city. There is one column for the year (ANO), another column for the month (MES), and one column for each day of the month (D01-D31). Each day column has an associated column (V01-V31), indicating whether the measurement was valid or not (‘V’ indicates a valid measurement and ’N’ the opposite). The complete information about the variables can be found here.

Since our dataframe have data points that are not valid, it is better to set them to null values. In order to keep only valid measurements, I am going to use the function combine_first, which combines two dataframe objects by filling null values in one dataframe with non-null values from other DataFrame.

To create the two dataframe, we can use regular expressions to identify column names that match a specific pattern (to make or life easier, we have converted the column names to lowercase). We want to identify daily measurements and daily measurements validity:

This code will return this:

Now, we just have to set the ‘V’ in the right dataframe for null values and use the function combine_first to fill the null values with measurements in the left dataframe. Once the now dataframe is created, the remaining ’N’ can be set to null. Finally, we can add the remaining columns.

Next, we want to simplify the dataframe by replacing the days, years and days for a single column with the date. This will later allow us to exploit the power of the DateTime format.

After this step, our dataframe looks much more appealing:

However, there is still some room for improvement. In many occasions, we may not be interested in getting air quality data per station, but in an aggregated fashion. Thus, we can use the function groupby to get the daily mean emissions per pollutant.

Finally, we can use the function pivot to produce new columns based on the column values in ‘magnitud’. The new columns will represent the measurements of the different pollutants. The numeric code identifying each pollutant can be found here.

Finally, we can convert our date index in DateTime format. The resulting dataframe will look as follows:

Hopefully, these simple preprocessing steps can be of help for anyone interested in analysing air quality data in Madrid.

--

--

Javier Canales
Analytics Vidhya

I am Freelance Data Scientist with a background in Law and Political Science. Learning as a way of life. Solving the climate crisis as a meaning of life