https://www.pexels.com/photo/overhead-shot-of-a-paper-with-graphs-and-charts-7947663/
Photo by RODNAE Productions

DATA VISUALIZATION 101

Seaborn charts that every Data Scientist Knows!

Master data visualization with these Seaborn charts

Prathamesh Gadekar
7 min readApr 3, 2023

--

Introduction:

Data isn’t just numbers, but a narrative waiting to be uncovered.

The world of data science has witnessed an astonishing surge, and I’ve had a front-row seat to this remarkable journey. 📈💡

Seaborn is a Python library that allows us to plot graphs and plots that help us extract useful insights from data 🐍📊. I

n this blog, we won’t just scratch the surface; we’ll dive deep into Seaborn’s capabilities.

Without further ado, let’s get started.

We are going to use Iris Dataset for visualization purposes. You can find the dataset here.

We begin by importing and processing the data.

If you’re not using Kaggle, you might have to put the local address of the Iris.csv file that you downloaded to your computer.

#Importing Libraries
import numpy as np
import pandas as pd
import seaborn as sns

#Getting the data in pandas DataFrame Format
data = pd.read_csv('/kaggle/input/iris/Iris.csv')
data.head() #Printing first 5 values in the dataset
First 5 values of dataset
First 5 values of the Dataset. (Source: Author)

The SepalLengthCm and SepalWidthCm are the length and width of the sepals of an Iris flower, calculated in centimeters. Similarly, PetallLengthCm and PetallWidthCm are the length and width of the petals of the Iris flower, measured in centimeters. Species indicate the species of the flower. There are three types of Iris species: Iris setosa, Iris virginica, and Iris versicolor.

Lineplot

In this plot, the relationship between two variables (x and y) is shown by plotting them on a 2D graph and connecting them. It shows how a variable changes over time.

There are several uses for line plots, including studying dynamic variables (that change over time), finding patterns, and tracking trends. We can generate a line plot using the lineplot function.

#Creating a line plot between Sepal Length and Sepal Width of all the
#species
sns.lineplot(data=data,x = 'SepalLengthCm',y='SepalWidthCm')
Line Plot (Source: Author)
#We can also plot separate line plots for each of the species.
sns.lineplot(data=data,x = 'SepalLengthCm',y='SepalWidthCm',hue = 'Species')
Separate Line Plots for all the categories (Source: Author)

Scatterplot

Scatter plots are similar to line plots but preferred for static variables (that don’t change over time). Basically, it plots all the points on a 2D graph.

Scatter plots are used to determine the correlation between two variables. We can use it via the scatterplot function.

#Generating a Scatter Plot of all the data points.
sns.scatterplot(data=data,x = 'SepalLengthCm',y='SepalWidthCm')
Scatter Plot (Source: Author)
#We can also distinguish the points by their species
sns.scatterplot(data=data,x = 'SepalLengthCm',y='SepalWidthCm',hue = 'Species')
Scatter Plots for all the categories (Source: Author)

Histogram

One of the most important plots, a histogram can be used for a variety of purposes, including the detection of outliers, skewness, and variance in the data. Each bar shows the frequency/number of data points falling under a particular range on the x-axis. We can plot a histogram using the histplot function.

#Plotting histogram for all the species
sns.histplot(data=data,x='SepalLengthCm')
Histogram (Source: Author)
#Plotting histograms for each species
sns.histplot(data=data,x='SepalLengthCm',hue='Species')
Histogram for individual species (Source: Author)

Probability Density Function (PDF)

The probability density function calculates the probability that a random variable will be found within a range of values. The purpose of this method is to determine which distribution a variable belongs to.

By revealing which distribution the variable belongs to, we can select the most effective machine-learning model to run on the variable to obtain the most accurate results. We use kernel density estimation to calculate the probability density function. We can plot it using the kdeplot function.

#Plotting the probability density function for Petal Length of all species
sns.kdeplot(data=data,x='PetalLengthCm')
Probability Density Function (Source: Author)
#Plotting the probability density function for Petal Length for 
#individual species
sns.kdeplot(data=data,x='PetalLengthCm',hue='Species')
Probability Density Function for individual species (Source: Author)

We can see from the above plot that Iris-Setosa’s probability density function does not overlap with other species. Iris-Setosa can therefore be distinguished easily from other species.

Boxplot

Box plot provides us with five insights about the data which are:

  • Lower Extreme — Tells us the lowest value in the data set
  • Upper Extreme — Provides us with the highest value in the dataset

The Lower Extreme and Upper Extreme are useful in detecting outliers.

  • Upper Quartile — Gives us the 75th percentile of the dataset which is the value at which 75 percent of the data falls (when in ascending order).
  • Lower Quartile — Gives us the 25th percentile of the dataset which is the value at which 25 percent of the data falls (when in ascending order).

The box (Lower Quartile to Upper quartile) is called the Interquartile range.

  • Median — Provides us with the median of the dataset

See the below diagram of the box plot for a clear understanding.

Box Plot (Source: Author)
#Plotting boxplot for individual species
sns.boxplot(data=data,y='PetalLengthCm',x='Species')
Box plot for individual species (Source: Author)

Violinplot

Violin plots are similar to boxplots but also indicate the variable’s probability density function. This gives it the appearance of a violin. You can plot it using violinplot function.

See the below diagram of the violin plot for a clear understanding.

Violin Plot (Source: Author)
#Plotting violinplot for individual species
sns.violinplot(data=data,y='PetalLengthCm',x='Species')
Violin plot for individual species (Source: Author)

Pairplot

Pair plot allows us to plot pairwise scatterplots for all non-categorical variables. For the same variables on the x-axis and y-axis, we get the histogram/probability density function of the variable.

Since it graphs all the plots together, it becomes very easy to analyze every feature and its correlation with other features.

The use of pair plots is not recommended when the dataset contains a large number of features because they require a considerable amount of time to plot. You can graph a pair plot using the pairplot function.

#Plotting pairplot
sns.pairplot(data=data)
Pair Plot (Source: Author)
#Plotting the pair plots for individial species
sns.pairplot(data=data,hue='Species')
Pair plot for individual species (Source: Author)

Heatmap

Heatmaps are generally used to study numerical variables’ correlation. Each cell is given a color that represents the correlation between the two variables.

Colors with a dark hue indicate a high positive correlation between the variables, whereas colors with a lighter hue indicate a high negative correlation between the variables. You can plot a heatmap using the heatmap function.

#Plotting the heatmap
sns.heatmap(data=data.corr(), annot=True,cmap = "GnBu")
Heatmap (Source: Author)

From the heatmap, we come to know that the pairs that have a high correlation are:

  • Petal Length and Sepal Length
  • Sepal Length and Petal Width
  • Petal Length and Petal Width

In addition, we observe that all diagonal elements have a correlation of 1. This is because those squares correlate with the same variables, hence it is a perfect correlation.

Jointplot

In a joint plot of two variables (x and y), we plot the scatter plot and the histogram/probability density function of the two variables. The joint plot can be used for both univariate and bivariate analyses.

Joint plot = Scatter Plot + Histogram/Probability density function of the two variables.

#Plotting joint plot of Sepal Width and Petal Length
sns.jointplot(data=data,x='PetalLengthCm',y='SepalWidthCm')
Joint Plot (Source: Author)
#Plotting joint plot for individual species
sns.jointplot(data=data,x='PetalLengthCm',y='SepalWidthCm',hue='Species')
Joint Plot for individual species (Source: Author)

Rugplot

Similar to the joint plot, the rug plot can be used for both univariate and bivariate analyses. The rug plot shows the marginal distribution of variables on the x- and y-axes. Rug plots can be drawn for both single and multiple variables.

#Plotting rug plot for only Petal length for individual species
sns.rugplot(data=data,x='PetalLengthCm',hue='Species')
Rug plot for a single variable (Source: Author)
#Plotting rug plot for only Petal length and Sepal Length for individual
#species
sns.rugplot(data=data,x='PetalLengthCm',y='SepalLengthCm',hue='Species')
Rug plot for two variables (Source: Author)

Conclusion

For univariate analysis, we can use

  • Line plot
  • Box plot
  • Violin plot
  • Probability Density function
  • Histogram
  • Joint plot
  • Rug plot

For bivariate analysis, we can use

  • Pair plot
  • Scatter plot
  • Joint plot
  • Heatmap
  • Rugplot

👋 Greetings!

Thanks for sticking around for the rest of the blog! I hope you had a great time!

I cover all kinds of Data Science & AI stuff…. and sometimes Programming.

To have stories sent directly to you, subscribe to my newsletter.

--

--

Prathamesh Gadekar

A Computer Science Enthusiast. I write about machine learning and data science.