Visualizing Air Pollution data using seaborn library

Published in

We Are Orb

4 min readDec 19, 2017

In the previous post, we saw some basic visualization techniques using pandas and matplotlib . With the same prerequisites, lets now delve into some advanced plotting techniques using the seaborn library.

Seaborn is a data-viz library in Python. It is built on top of matplotlib. Thus we often invoke matplotlib functions directly to draw some of the plots. It includes support for data structures provided by numpy and pandas such as dataframes. Thus it is possible to extract semantic information from Pandas objects to add informative labels to our plots. Using seaborn we can create multiple plots, use different themes and color schemes and even find a regression model fit to our data.

Let the plots begin!

Step 1: The libraries

First we need to install seaborn to get started. If you already have Anaconda installed, the easiest way is to open your Anaconda Prompt and run:

pip install seaborn

There are other ways of installation mentioned here.

Once the installation is done, let’s open up jupyter notebook and import the libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: The Data set

Let us analyze US Pollution data from 2000–2016 obtained from Kaggle. It contains statistical data about the levels of pollutants — SO2, NO2, CO and O3 — in the air in different states. Various visualizations can be created from this data.

Before that, the data set has some null values which we need to eliminate. Lets implement some techniques mentioned this article on basic data cleansing:

# Remove empty cells in the dataframe
data.dropna(inplace=True)
data.reset_index(inplace=True, drop=True)

Step 3: Plot the data

Seaborn library offers 5 themes — darkgrid, whitegrid, dark, white, and ticks . You can see how these themes look here.

For starters, lets plot NO2 Air Quality Index in each state.

sns.set(style="dark")
ax = sns.barplot(x="NO2 AQI", y="State", data=data, ci=100)
fig = plt.gcf()
fig.set_size_inches(30, 13)

Plotting multiple variables in the same plot

Comparing the data for 2 states using density plots

Now, I want to compare CO and SO2 distributions in just 2 states, in the same plot. Is it possible? Of course it is! Through pandas data manipulation and seaborn's density plot in different colors:

Lets use a different background this time:

sns.set(style="white")

First we extract the data for the 2 desired states into separate dataframes:

alaska = data1.query("State == 'Alaska'")
virginia = data1.query("State == 'Virginia'")

Then we plot both of them in the same figure as subplots:

# Set up the figure
f, ax = plt.subplots(figsize=(15, 25))
ax.set_aspect("equal")# Draw the two density plots
ax = sns.kdeplot(alaska["CO AQI"], alaska["SO2 AQI"],
                 cmap="Reds", shade=True, shade_lowest=False)
ax = sns.kdeplot(virginia["CO AQI"], virginia["SO2 AQI"],
                 cmap="Blues", shade=True, shade_lowest=False)# Add labels to the plot
red = sns.color_palette("Reds")[-2]
blue = sns.color_palette("Blues")[-2]
ax.text(2.5, 8.2, "Virginia", size=16, color=blue)
ax.text(3.8, 4.5, "Alaska", size=16, color=red)

Visualizing relationship between multiple parameters

So far, we have seen how one gas (NO2) was distributed across all the states and we compared the distribution of 2 gases (SO2 and CO) across 2 states.

Suppose we want to observe how all the 4 types of gases vary with respect to each other across the states, instead of specifying their names manually, we can do:

sns.set(style="darkgrid") # just another theme
sns.pairplot(data1, hue="State")

When we do this, seaborn plots the relationship between each pair of columns in the DataFrame, color coded state-wise (since we mention hue="State" which essentially means categorize by state). As a result, this HUGE plot is obtained. Notice that the diagonal axis of the image gives the univariate distribution of each variable.

Code Repository

You can find the code used in this article here.

Where to go from here

Seaborn also enables to do simple regression model fits on our plots. regplot() and lmplot() are the two main functions for this purpose. But to properly understand that, you will need to know what linear regression is. This post on Linear regression in Python is a good starting place.

Useful Resources

The official doc is your go-to place. It has a detailed tutorial, an API reference and a gallery of many examples.

Places where you can get Open Public Datasets

Kaggle’s datasets

UCI Machine learning Repository

Awesome list of open datasets

Questions? Comments? Leave a note below.

If you liked this article, show us some ❤ and 👏 and follow our publication for more awesome articles on data science from authors 👫 around the globe and beyond. Thanks for reading.