Quick guide to Visualization in Python
Basics to create useful visuals in python using 'matplotlib' and 'seaborn'
Visualizing data is the key to exploratory analysis. It is not just for aesthetic purposes, but is essential to uncover insights on data distributions and feature interactions.
In this article, you will be introduced to the basics of creating some useful and common data visualizations using the ‘matplotlib’ and ‘seaborn’ modules in python. The built-in dataset ‘iris’ from sklearn module is used for the demonstration. Only the main arguments of each plot are showcased here that will let you create simple plots without much pomp and glamour, yet serves the purpose.
Basic Chart Elements
The basic chart elements such as chart title, axes labels, figure size and axes limits are common to all plots. Let’s first see how to set these and use that as the template for any plot you wish to create using ‘matplotlib’ and ‘seaborn’.
figure, figsize : Initiate the plot area and figure size
title : Set the plot title
xlabel, ylabel : Set the x and y axes labels
xlim, ylim : Set the x and y axes limits (optional). These limits will be automatically set based on the data
show : Display the plot
import matplotlib.pyplot as plt
import seaborn as snsplt.figure(figsize=(5,5))
## Line to create the desired plot ##
plt.xlabel("X Axis Label")
plt.ylabel("Y Axis Label")
plt.xlim(0,1)
plt.ylim(0,1)
plt.title("Chart title")
plt.show()
That’s it! Now, you are ready to fill in the blank chart area with any visualization of your choice. These basic elements remain the same irrespective of the plot you create. Let’s now see some common plots that come handy during data exploration.
Loading the Iris data
First up, we need some data to play around with. Here’s how you load the commonly used ‘iris’ data from sklearn. To learn more about the different built-in datasets available in python and how to access them, check out this article.
from sklearn import datasets
import pandas as pdiris = pd.DataFrame(datasets.load_iris().data, columns=datasets.load_iris().feature_names)
iris['species'] = [datasets.load_iris().target_names[i] for i in datasets.load_iris().target]iris.head()
Visualizations
Below are the plots that are demonstrated in this article.
- Histogram
- Scatter plot
- Pair plot
- Pie chart
- Count plot
- Bar plot
- Box plot
- Line plot
- Heat maps
Histogram
Histogram is the go-to plot for viewing how the numerical data is distributed. This is probably the simplest plot you will come across, yet the most beneficial in getting a first look of the data to study the spread of values.
hist(x,bins) : Function to plot histogram where x is a single column of pandas dataframe and bins define the number of buckets in which the values will be segregated
plt.figure(figsize=(5,5))
plt.hist(x = iris['sepal length (cm)'], bins = 10)
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Frequency")
plt.title("Histogram of Sepal Length")
plt.show()
Scatter Plot
Scatter plot shows the relationship between two numerical variables. This is a crucial plot in setting up regression-like analyses.
scatter(x,y) : Function to plot x vs y variables
plt.figure(figsize=(5,5))
plt.scatter(x=iris['petal length (cm)'], y=iris['petal width (cm)'])
plt.xlabel("Petal Length (cm)")
plt.ylabel("Petal Width (cm)")
plt.title("Petal Length vs Petal Width")
plt.show()
Pair Plot
Now, let us use ‘seaborn’ to create plots among pairs of numerical variables. Histograms will be plotted for individual variables and scatter plots for the pairs. If you have any categorical variable of interest, it can be used as the ‘hue’ argument in the pairplot function to see the points in the plot separately colored for each category.
pairplot(data, hue) : Function to plot the pairs of numerical variables in ‘data’ and colored by the variable mentioned in ‘hue’
Unlike matplotlib, a single line of code is sufficient to get the plot in the required format with all basic elements.
sns.pairplot(data=iris, hue='species')
Pie Chart
For categorical data, pie chart can be used to see the proportion of each category in a field. This is not a great choice of visualization if you have too many categories as the readability of the plot greatly reduces.
pie(x,labels,autopct) : Function to plot pie chart with ‘x’ as the counts or values, ‘labels’ denoting the categories, ‘autopct’ to define the format of the values to be displayed on each slice
plt.figure(figsize=(5,5))
plt.pie(x=iris['species'].value_counts(),labels=iris['species'].value_counts().index, autopct='%0.2f%%')
plt.title("Iris Species Proportion")
plt.show()
Count Plot
The ‘countplot ‘ function in ‘seaborn’ is a better choice over pie chart to plot the proportion of each class in a categorical field. Bars are more understandable in studying proportions as compared to pie slices.
countplot(x) : Function to plot the count of data points in each class of a categorical field
sns.countplot(x=iris['species'],color='green')
Bar Plot
Bar plots are used when you need to plot a numerical variable across a categorical variable.
barplot(x,y) : Function in ‘seaborn’ to plot bars with x (usually a categorical field) and y (numerical field)
The same can be achieved using ‘bar’ function in matplotlib with the similar arguments — x for categorical and height for numeric fields.
# plt.bar(x=iris['species'], height=iris['petal length (cm)'])
sns.barplot(x=iris['species'],y=iris['petal length (cm)'])
The same plot can be converted to a horizontal bar plot if you just flip the x and y inputs in barplot function. Similarly, if you are using matplotlib, ‘barh’ function can be used with arguments y and width.
# plt.barh(y=iris['species'], width=iris['petal length (cm)'])
sns.barplot(y=iris['species'],x=iris['petal length (cm)'])
Box Plot
Box plot is useful to get a glance of the data distribution for numerical variables. The ‘box’ in box plot indicates the IQR (Inter Quartile Range) spanning from 25th to 75th percentile of the data with an enclosed line to mark the 50th percentile or the median. If a categorical variable is also provided in the arguments, the box plots will be created separately for the classes.
boxplot(x,y) : Function to create boxplot. If only ‘x’ is given, horizontal box plot is created; If only ‘y’ is given, vertical box plot is created; When both x (usually categorical) and y (numerical) are given, box plots for y for each level in x are created
# Box plot for a numerical variable #
sns.boxplot(y=iris['sepal length (cm)'])# Box plot for a numerical variable by the levels of a categorical field #
sns.boxplot(x=iris['species'],y=iris['sepal length (cm)'])
Line Plot
Line plots can be used to study the trend and relationship between two numerical variables. Multiple line plots sharing the same x-axis can be plotted in a single graph to study different relations and trends in numerical data.
plot(x,y,label) : Function to create a line plot with ‘x’ on x axis and ‘y’ on y axis, ‘label’ indicating the name of the series being plotted
x = list(range(10))
y = [i**2 for i in x]
z = [i**3 for i in x]plt.figure(figsize=(10,5))
plt.plot(x,y,label='X^2')
plt.plot(x,z,label='X^3')
plt.xlabel("X")
plt.ylabel("X^2, X^3")
plt.title("Square and Cube of X")
plt.legend()
plt.show()
Heat-map
The ‘heatmap’ functionality in ‘seaborn’ is very useful for visualizing missing data and correlation matrix.
heatmap(data, cmap, cbar, annot) : Function to plot heat-map using ‘data’ with color gradient map ‘cmap’, ‘cbar’ (boolean) to indicate whether or not to use a color bar, ‘annot’ (boolean) to display data values in each cell
Let’s try visualizing missing data without any color bar. Since ‘iris’ data does not have any missing values, I have created a copy named ‘iris_with_nulls’ where I have randomly removed some data points. This is to show how the heat-map looks like if missing values are there.
sns.heatmap(data=iris_with_nulls.isnull(),cbar=False)
sns.heatmap(data=iris.isnull(),cbar=False)
Next up, let us visualize the correlation matrix using the heat-map. The use of color bars and annotation renders the correlation matrix a powerful visual to analyze the relationships in numerical data.
sns.heatmap(data=iris.corr(),annot=True)
sns.heatmap(data=iris.corr(),annot=True,cmap='RdYlGn')
Now, you have all the necessary items ready in your visualization tool-kit. Go ahead and plot away your data!