Data Visualization for Beginners

Data visualization is representing data in visual form by creating plots. Data visualization not only makes the data easier to understand but also makes it easier to detect patterns, trends, and outliers in groups of data. Data visualization helps us to identify areas that may need attention or improvement. In this article I’m going to show you how to create plots for Exploratory Data Analysis and how to customize them. I will then show you how to convert the plot into a function so it can be re-used easily by replacing the arguments, if necessary, and calling it.

I recommend reading this article about object oriented functions. I have based my article on this approach:

https://dev.to/skotaro/artist-in-matplotlib---something-i-wanted-to-know-before-spending-tremendous-hours-on-googling-how-tos--31oo

We’ll be working with both matplotlib and seaborn so we need to import these libraries:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

If you’re wondering why we need to import matplotlib inline, here’s the answer: %matplotlib inline sets the backend of matplotlib to the ‘inline’ backend. With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook. There are two plotting styles that I have been using: pyplot and the Object-Oriented APIs.

Pyplot Functions

As a total beginner to programming and Data Science I have been using the pyplot functions in which everything is prefaced with plt. This is a good way to create quick plots but there are limitations and when subplotting it gets confusing.

Below is a quick way to create a plot if you want something really basic:

plt.scatter(df_categories[‘grade’], df_categories[‘price’]);
plt.title(‘Grade v. Price’);

Scatterplot using Pyplot

When creating subplots with the pyplot functions we can easily see the problem if we need to customize the plots, e.g., which plot are we talking about. There are ways to deal with this but it gets confusing and there are more efficient ways.

values = [1, 10, 100]plt.figure(figsize=(9, 3))plt.subplot(131)
plt.bar(names, values)
plt.subplot(132)
plt.scatter(names, values)
plt.subplot(133)
plt.plot(names, values)
plt.suptitle('Categorical Plotting')
plt.show()

Objected-Oriented API Interface

I am learning the Objected-Oriented API Interface because I find the methods and attributes much easier for creating, editing and customizing plots. I will be working histograms and scatters plots, however, any of the various plots can be used.

What is a histogram? A histogram is an accurate representation of the distribution of numerical data. Wikipedia. A histogram shows the frequency on the vertical axis and the horizontal axis is another dimension, such as count, density etc. Usually it has bins, where every bin has a minimum and maximum value. Each bin also has a frequency between x and the specified range.

A scatterplot is used to plot data points on a horizontal and a vertical axis in the attempt to show how much one variable is affected by another. It shows us the relationship between 2 variables.

Figure and Axes Objects:

We are still working with the Matplotlib library but utilizing the axes object, instead of the plt. function.

A matplotlib figure consist of 3 objects: 1) the figure itself which is considered the container for the plot. It’s like the picture frame.

2) The axes object. The axes are the actual plot(s)/image(s) inside the picture figure/frame.

The axes are the subplot. You can create a single plot or multiple plots so don’t let the word subplot lead you into thinking that this only works for multiple plots. All elements of the plot are contained in the axes object, which is inside the figure object. The axes contain information such as title, labels, grid, etc.

3)Inside Axes there is an axis which is further divided into an axis.xaxis and an axis.yaxis that contain the ticks and the tick lables.

Now that we have some background information, let’s make some plots.

The axes is an object so we’ll use the .set_ attributes.

col = ‘bedrooms’
fig, axes = plt.subplots(ncols=1,figsize=(7,5))
sns.distplot(df[col], ax=axes);
axes.set_title(‘Distribution of {}’.format(col.title()));

What’s great is that we can use variable names for our features which is important when we create a function for our plot.

When making a plot with seaborn it’s working with pandas so it automatically inputs the name of the column.

Next, we’ll see that we can start to add more plots to the axes. We are able to combine matplotlib and seaborn plots. We’ll create a histogram and a scatterplot.

Because we’re working with 2 plots we need to change ncols to ncols=2.

cols = ‘bedrooms’
fig, axes = plt.subplots(ncols=2,figsize=(18,6))

#seaborn displot — histograam
sns.distplot(df[cols], ax=axes[0]) #tell it we want plot on axes 0

#this is being called from pandas
df.plot(kind=’scatter’,x=cols,y=’price’, ax=axes[1]) #on axis 1

#matplotlib function automatically adjusts subplots for you, so axes won’t encroach on each other
plt.tight_layout()

Next, we’ll see how we can customize our titles, labels and applicable fonts.

col = ‘bedrooms’
fig, axes = plt.subplots(ncols=2,figsize=(10,6))

label_fonts = {‘weight’:’bold’,
‘family’:’serif’,
‘size’:12}

title_fonts = {‘weight’:’bold’,
‘family’:’serif’,
‘size’:20}

ax = axes[0] #first, top left
sns.distplot(df[col], ax=ax) #tell it we want axis to be axes 0

ax.set_title(f’Distribution of {col}’,fontdict=title_fonts)
ax.set_ylabel(‘Density’,fontdict=label_fonts)
ax.set_xlabel(ax.get_xlabel(), fontdict=label_fonts)

ax = axes[1] #first, top right
df.plot(kind=’scatter’,x=cols,y=’price’, ax=ax) #on axis 1
ax.set_title(f’{col.title()} vs. Price’, fontdict=title_fonts)
ax.set_ylabel(ax.get_ylabel(),fontdict=label_fonts)
ax.set_xlabel(ax.get_xlabel(),fontdict=label_fonts)

plt.tight_layout()

A few things to point out:

Since we’re using object oriented methods I set the title and feature names with the ax.set_ and the available attributes, e.g. .set_title, .set_xlabel.

In order to customize the labels I needed to get the feature names using .get_xlabel(). We’re able to then tailor our title and labels with the parameter fontdict which requires a dictionary of specifications.

Create Function to Create Plots

A function can be easily created and reused when needed. I want to point out the following:

  • all code in bold needs to be indented.
  • function arguments are dataframe, x -feature, y- feature, figure size

def eda_plot(df,col = ‘bedrooms’,target=’price’, figsize=(10,6)):

fig, axes = plt.subplots(ncols=2,figsize=figsize)

label_fonts = {‘weight’:’bold’,
‘family’:’serif’,
‘size’:12}

title_fonts = {‘weight’:’bold’,
‘family’:’serif’,
‘size’:20}

ax = axes[0] #first, top left
sns.distplot(df[col], ax=ax) #tell it we want axis to be axes 0
ax.set_title(f’Distribution of {col}’,fontdict=title_fonts)
ax.set_ylabel(‘Density’,fontdict=label_fonts)
ax.set_xlabel(ax.get_xlabel(), fontdict=label_fonts)

ax = axes[1] #first, top right
df.plot(kind=’scatter’,x=col,y=target, ax=ax) #on axis 1
ax.set_title(f’{col.title()} vs. {target}’, fontdict=title_fonts)
ax.set_ylabel(ax.get_ylabel(),fontdict=label_fonts)
ax.set_xlabel(ax.get_xlabel(),fontdict=label_fonts)

plt.tight_layout()

eda_plot(df)#call function

I hope this blog entry provided you with some insights about using the Object-Oriented API Interface for the data visualization part of Exploratory Data Analysis.

--

--