Data Visualization with Python and Seaborn — Part 1: Loading Datasets

Random Nerd
8 min readAug 16, 2018

When working with Seaborn, we can either use one of the built-in datasets that Seaborn offers or we can load a Pandas DataFrame. Seaborn is part of the PyData stack hence accepts Pandas’ data structures. Let us begin by importing few built-in datasets but before that we shall import few other libraries as well that our Seaborn would depend upon:

# IMPORTING REQUIRED LIBRARIES & ASSIGNING ALIASES:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

The dataset we would be dealing with in this illustration is Iris Flower Dataset. Similarly we shall load other built-in dataset as well later on. Let us also take a sneak peek as to how this Iris dataset looks like and we shall be using Pandas to do so.

# Loading built-in Datasets:
iris = sns.load_dataset("iris")

Iris dataset actually has 50 samples from each of three species of Iris flower (Setosa, Virginica and Versicolor). Four features were measured (in centimeters) from each sample: Length and Width of the Sepals and Petals. Let us try to have a summarized view of this dataset:

.describe() is a very useful method in Pandas as it generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset distribution, excluding NaN values. Without getting in-depth into analysis here, let us try to plot something simple from this dataset:

This beautiful representation of data we see above is known as a Swarm Plot with minimal parameters. I shall be covering this in detail later on but for now I just wanted you to have a feel of serenity we're getting into. Let us now try to load a random dataset and the one I’ve picked for this illustration is PoliceKillingsUS dataset. This dataset has been prepared by The Washington Post (they keep updating it on runtime) with every fatal shooting in the United States by a police officer in the line of duty since Jan. 1, 2015.

# Loading Pandas DataFrame:
df = pd.read_csv('~/Downloads/PoliceKillingsUS.csv', encoding='windows-1252')

Always take a note of your dataset, and choose encoding accordingly, or else you might not be able to properly decode dataset into a Pandas DataFrame. Few of the common options include utf-8 , utf-16, latin-1 , iso-8859-1 , iso-8859-15 & cp1252. Also ensure to mention complete PATH to your dataset, if it isn’t in the same local directory as your IDE (Jupyter Notebook, for instance). If unaware of the PATH, run pwd in an Input cell of Jupyter Notebook to fetch that information. For more detailed know-how on Jupyter Notebook, please refer to my other article focusing entirely on that subject. Moving on, just the way we looked into Iris data set, let us know have a preview of this dataset as well. We won’t be getting into deep analysis of this dataset because our agenda is only to visualize the content within & gradually discover the statistical reference. So, let’s do this:

This dataset is pretty self-descriptive and has limited number of features (may read as columns).

race: W: White, non-Hispanic B: Black, non-Hispanic A: Asian N: Native American H: Hispanic O: Other None: unknown

And, gender indicates: M: Male F: Female None: unknown. The threat_level variable include incidents where officers or others were shot at, threatened with a gun, attacked with other weapons or physical force, etc. The attack category is meant to flag the highest level of threat. The other and undetermined categories represent all remaining cases. Other includes many incidents where officers or others faced significant threats.

The threat column and the fleeing column are not necessarily related. Also, attacks represent a status immediately before fatal shots by police; while fleeing could begin slightly earlier and involve a chase. Lastly, body_camera indicates if an officer was wearing a body camera that may have recorded some portion of the incident. Let us now look into the descriptive statistics to figure out the quartiles, sum and entire data population spread-out:

This plot is known as a Strip plot and pretty ideal for categorical values. Even this shall be dealt in length soon in following articles. Next let us look into controlling aesthetics of our plot and few other important aspects. One of the biggest advantages of Seaborn over Matplotlib plots is it’s default aesthetics are visually far more appealing. Undoubtedly Matplotlib is highly customizable, but sometimes it may get difficult to know exact settings to tweak in order to achieve an attractive plot unless you know to navigate through Matplotlib documentation. Whereas, Seaborn comes with a number of customized themes and a high-level interface for controlling the look of similar Matplotlib figures.

Seaborn splits Matplotlib parameters into two independent groups: First group sets the aesthetic style of the plot; and second scales various elements of the figure to get easily incorporated into different contexts. Seaborn doesn’t take away any of Matplotlib credits, but rather adds some nice default aesthetics and built-in plots that complement and sometimes replace the complicated Matplotlib code professionals needed to write. Facet plots and Regression plots are an example of that. In this lecture I shall show how easy it is to build a Regression plot using Seaborn and then let us compare it by building something similar in Matplotlib. Quick Note: The term Aesthetics just refers to the appearance of the figure/plot.

Quick Note: sns.set() needs to be explicitly called ONLY in Seaborn v0.8 and above.

Let us now look into what else can we do to vary appearance of our plot. Earlier we spoke about Seaborn segregating Matplotlib parameters into 2 independent groups: Styling and Scaling of figures. Let us begin by delving into Styling aspect, that is controlled by using sns.axes_style() and sns.set_style() functions.

Few common background options include: whitegrid (as shown above), dark (for solid grey color), white (default) and ticks. Let us explore an example with ticks as background and then try to remove top and right axes spines:

In the above plot we observe 2 changes:

  • There are no horizontal line/grid in the background, that we had in previous figure.
  • There are ticks on X and Y axes, representing the axis interval.

We still have top and right axes spine so let’s get rid of that now:

If we also want to despine left axis, we just need to alter our last line of code by adding a parameter: left=True, and it shall fetch:

Now let us try to temporarily visualize two types of background in a single plot. For this, we shall use .axes_style() as follows:

Let us now look into other independent group of Matplotlib parameters created by Seaborn, i.e. Scaling of plot elements; that is controlled using .plotting_context() and .set_context() functions. The four preset contexts, in order of relative size, are paper, notebook, talk, and poster. The notebook style is default, and was used in all the plots shown above. At first, let’s reset the default parameters by calling sns.set() & then play around with other contexts for scaling, starting with paper:

I am pretty sure you must be thinking that this figure/plot in no ways is scaled as it looks similar to our previous plot outputs. So, I shall clarify right away: Jupyter Notebook scales down large images in the notebook cell output. This is generally done because past a certain size, we get automatic figure scaling. For exploratory analysis, we prefer iterating quickly over a number of different analyses and it’s more useful to have facets that are of similar size; than to have overall figures that are same size in a particular context. When we’re in a situation where we need to have something that’s exactly a certain size overall; ideally we:

  • know precisely what we want and
  • can afford to take off some time and work through the calculations.

With all that being said, if we plot the same figure in an editor like Atom or Anaconda Spyder or JetBrains’ PyCharm or IntelliJ, we shall be able to visualize them in their original size as it shall launch a new window with visualization output. Hence what needs to be our take-away from scaling segment, is that an addition of a line of code can fetch the size of image as per our requirement and we may experiment accordingly. In practical world, we can also add a dictionary of parameters using rc to have a finer control over the aesthetics. Let me show you an example with the same sinplot function we defined earlier:

Though our Notebook didn’t display enlarged (scaled) plot, we may notice how in the backend (in memory) it has created the figure as per our instructions. We have thick lines now in our plot because I set linewidth to 5, font size on axes have thickened because of font_scale. Generally we don't use anything more than that during data analysis although exceptional scenarios may demand few more parameters as per requirement which we will gradually be taking care of in our next article of this series that you can access using below listed navigation options. In case of any queries, please feel free to leave a comment, and if you liked it, leaving few claps would keep me encouraged to add more quality content for you to learn from. Appreciate your time and patience!

.

Data Visualization with Python and Seaborn — Part 0

Data Visualization with Python and Seaborn — Part 2

--

--